Ingestion Hybrid Index + Canonical English (Issue #27)
What this adds
The ingestion pipeline now prepares both hybrid retrieval artifacts from the same ingest run:
- Semantic payload (Pinecone-ready vectors + lightweight metadata)
- Lexical payload (Postgres
chunksrows, FTS/BM25-ready)
It also canonicalizes chunk content to English while preserving source-language provenance.
Canonicalization contract
For each chunk we now persist:
original_contentcontent(canonical English text)source_languagecanonical_language(en)canonicalization_provider(gpt-mini,fallback-original,none)canonicalization_confidencecanonicalization_fallback
This keeps retrieval canonicalized while preserving traceability back to source text.
Hybrid readiness checks
Before upsert, ingestion verifies semantic and lexical payload counts are aligned:
len(semantic_payload) == len(lexical_rows)
If mismatch occurs, the job fails fast with an explicit error.
If semantic upsert succeeded but later lexical/finalization fails, ingestion performs a semantic rollback (delete_by_chunk_ids) to avoid partial hybrid state.
Stage telemetry
INGESTION_STAGE structured logs are emitted for:
reference_loadparse_and_chunkcanonicalizationembedding_generationconsistency_checksemantic_upsertlexical_upsertfinalize
Each stage log includes timing (duration_ms) and input/output counts, plus stage-specific extras (namespace, lexical stats, stale deletions, etc.).
Idempotent reruns
Chunk IDs are deterministic (sha256(reference_id:page_number:chunk_index)), and lexical rows are upserted by chunk_id.
On rerun, stale chunk rows no longer present in the current payload are deleted, keeping lexical index state consistent with the latest ingest.
Schema migration
Migration: db/migrations/20260301000030_hybrid_ingestion_canonicalization.sql
Adds canonicalization/provenance + BM25-readiness columns on chunks, and hybrid-readiness fields on documents.
Rollback guidance
If rollback is required:
- Pause ingestion workers/jobs that rely on hybrid/canonicalization fields.
- Drop migration-created indexes (
idx_chunks_source_language,idx_chunks_canonical_language,idx_chunks_bm25_term_count,idx_documents_hybrid_index_ready). - Remove migration-added columns from
chunksanddocumentsonly after confirming no runtime code path depends on them.
This rollback is destructive for hybrid/canonical telemetry history, so capture any needed analytics snapshots before column removal.