Ingestion Hybrid Index + Canonical English (Issue #27)

What this adds

The ingestion pipeline now prepares both hybrid retrieval artifacts from the same ingest run:

Semantic payload (Pinecone-ready vectors + lightweight metadata)
Lexical payload (Postgres chunks rows, FTS/BM25-ready)

It also canonicalizes chunk content to English while preserving source-language provenance.

Canonicalization contract

For each chunk we now persist:

original_content
content (canonical English text)
source_language
canonical_language (en)
canonicalization_provider (gpt-mini, fallback-original, none)
canonicalization_confidence
canonicalization_fallback

This keeps retrieval canonicalized while preserving traceability back to source text.

Hybrid readiness checks

Before upsert, ingestion verifies semantic and lexical payload counts are aligned:

len(semantic_payload) == len(lexical_rows)

If mismatch occurs, the job fails fast with an explicit error.

If semantic upsert succeeded but later lexical/finalization fails, ingestion performs a semantic rollback (delete_by_chunk_ids) to avoid partial hybrid state.

Stage telemetry

INGESTION_STAGE structured logs are emitted for:

reference_load
parse_and_chunk
canonicalization
embedding_generation
consistency_check
semantic_upsert
lexical_upsert
finalize

Each stage log includes timing (duration_ms) and input/output counts, plus stage-specific extras (namespace, lexical stats, stale deletions, etc.).

Idempotent reruns

Chunk IDs are deterministic (sha256(reference_id:page_number:chunk_index)), and lexical rows are upserted by chunk_id.

On rerun, stale chunk rows no longer present in the current payload are deleted, keeping lexical index state consistent with the latest ingest.

Schema migration

Migration: db/migrations/20260301000030_hybrid_ingestion_canonicalization.sql

Adds canonicalization/provenance + BM25-readiness columns on chunks, and hybrid-readiness fields on documents.

Rollback guidance

If rollback is required:

Pause ingestion workers/jobs that rely on hybrid/canonicalization fields.
Drop migration-created indexes (idx_chunks_source_language, idx_chunks_canonical_language, idx_chunks_bm25_term_count, idx_documents_hybrid_index_ready).
Remove migration-added columns from chunks and documents only after confirming no runtime code path depends on them.

This rollback is destructive for hybrid/canonical telemetry history, so capture any needed analytics snapshots before column removal.