Plan: GH-0027 Hybrid Ingestion and Canonicalization

Context

This issue upgrades ingestion so one run produces both semantic and lexical retrieval artifacts while preserving source-language provenance.

Problem

The backend needs ingestion output that supports hybrid retrieval, but retrieval quality and traceability suffer if canonicalization and storage boundaries are inconsistent.

Current state in repo

app/services/hybrid_ingestion.py prepares canonical-English semantic and lexical payloads.
app/services/ingestion.py coordinates ingestion jobs.
db/migrations/20260301000030_hybrid_ingestion_canonicalization.sql adds canonicalization and hybrid fields.
tests/services/test_hybrid_ingestion.py covers canonicalization and payload generation.

Target state

Each ingestion run produces aligned semantic and lexical artifacts.
Canonical English text is stored for retrieval while source-language provenance remains available.
Hybrid readiness is explicit at the document and chunk level.

Constraints

Backend-only scope.
Postgres remains the canonical store for chunk text and provenance.
Pinecone metadata should stay lightweight.
The ingestion path must remain idempotent and rollback-safe on partial failures.

Proposed approach

Canonicalize chunk content into English while preserving original text metadata.
Build Pinecone semantic payloads and Postgres lexical rows from the same chunk set.
Verify payload count alignment before finalizing the ingestion run.
Roll back semantic writes if lexical persistence fails later in the pipeline.

Risks

Translation fallback can reduce retrieval quality if canonicalization is poor.
Partial failures can leave hybrid state inconsistent if rollback paths are incomplete.
Extra ingestion work can increase latency and operational cost.

Open questions

Should canonicalization always target English, or should this be configurable later?
Which provider and confidence thresholds justify fallback behavior?

Acceptance criteria

A plan doc exists for #27 under docs/plans/.
The doc identifies dual semantic and lexical outputs as required.
The doc states provenance preservation and idempotent reruns as constraints.
The current migration and test touchpoints are named.

Files likely to change

docs/plans/gh-0027-hybrid-ingestion-canonicalization.md
app/services/hybrid_ingestion.py
app/services/ingestion.py
db/migrations/20260301000030_hybrid_ingestion_canonicalization.sql
db/bootstrap.sql
tests/services/test_hybrid_ingestion.py

#27 - [Backend][Ingestion] Hybrid index + English canonicalization in ingest endpoint

Status

Backfilled planning stub