Skip to content

Ingestion Hybrid Index + Canonical English (Issue #27)

What this adds

The ingestion pipeline now prepares both hybrid retrieval artifacts from the same ingest run:

  1. Semantic payload (Pinecone-ready vectors + lightweight metadata)
  2. Lexical payload (Postgres chunks rows, FTS/BM25-ready)

It also canonicalizes chunk content to English while preserving source-language provenance.

Canonicalization contract

For each chunk we now persist:

  • original_content
  • content (canonical English text)
  • source_language
  • canonical_language (en)
  • canonicalization_provider (gpt-mini, fallback-original, none)
  • canonicalization_confidence
  • canonicalization_fallback

This keeps retrieval canonicalized while preserving traceability back to source text.

Hybrid readiness checks

Before upsert, ingestion verifies semantic and lexical payload counts are aligned:

  • len(semantic_payload) == len(lexical_rows)

If mismatch occurs, the job fails fast with an explicit error.

If semantic upsert succeeded but later lexical/finalization fails, ingestion performs a semantic rollback (delete_by_chunk_ids) to avoid partial hybrid state.

Stage telemetry

INGESTION_STAGE structured logs are emitted for:

  • reference_load
  • parse_and_chunk
  • canonicalization
  • embedding_generation
  • consistency_check
  • semantic_upsert
  • lexical_upsert
  • finalize

Each stage log includes timing (duration_ms) and input/output counts, plus stage-specific extras (namespace, lexical stats, stale deletions, etc.).

Idempotent reruns

Chunk IDs are deterministic (sha256(reference_id:page_number:chunk_index)), and lexical rows are upserted by chunk_id.

On rerun, stale chunk rows no longer present in the current payload are deleted, keeping lexical index state consistent with the latest ingest.

Schema migration

Migration: db/migrations/20260301000030_hybrid_ingestion_canonicalization.sql

Adds canonicalization/provenance + BM25-readiness columns on chunks, and hybrid-readiness fields on documents.

Rollback guidance

If rollback is required:

  1. Pause ingestion workers/jobs that rely on hybrid/canonicalization fields.
  2. Drop migration-created indexes (idx_chunks_source_language, idx_chunks_canonical_language, idx_chunks_bm25_term_count, idx_documents_hybrid_index_ready).
  3. Remove migration-added columns from chunks and documents only after confirming no runtime code path depends on them.

This rollback is destructive for hybrid/canonical telemetry history, so capture any needed analytics snapshots before column removal.