Plan: GH-0027 Hybrid Ingestion and Canonicalization
Context
This issue upgrades ingestion so one run produces both semantic and lexical retrieval artifacts while preserving source-language provenance.
Problem
The backend needs ingestion output that supports hybrid retrieval, but retrieval quality and traceability suffer if canonicalization and storage boundaries are inconsistent.
Current state in repo
app/services/hybrid_ingestion.pyprepares canonical-English semantic and lexical payloads.app/services/ingestion.pycoordinates ingestion jobs.db/migrations/20260301000030_hybrid_ingestion_canonicalization.sqladds canonicalization and hybrid fields.tests/services/test_hybrid_ingestion.pycovers canonicalization and payload generation.
Target state
- Each ingestion run produces aligned semantic and lexical artifacts.
- Canonical English text is stored for retrieval while source-language provenance remains available.
- Hybrid readiness is explicit at the document and chunk level.
Constraints
- Backend-only scope.
- Postgres remains the canonical store for chunk text and provenance.
- Pinecone metadata should stay lightweight.
- The ingestion path must remain idempotent and rollback-safe on partial failures.
Proposed approach
- Canonicalize chunk content into English while preserving original text metadata.
- Build Pinecone semantic payloads and Postgres lexical rows from the same chunk set.
- Verify payload count alignment before finalizing the ingestion run.
- Roll back semantic writes if lexical persistence fails later in the pipeline.
Risks
- Translation fallback can reduce retrieval quality if canonicalization is poor.
- Partial failures can leave hybrid state inconsistent if rollback paths are incomplete.
- Extra ingestion work can increase latency and operational cost.
Open questions
- Should canonicalization always target English, or should this be configurable later?
- Which provider and confidence thresholds justify fallback behavior?
Acceptance criteria
- A plan doc exists for
#27underdocs/plans/. - The doc identifies dual semantic and lexical outputs as required.
- The doc states provenance preservation and idempotent reruns as constraints.
- The current migration and test touchpoints are named.
Files likely to change
docs/plans/gh-0027-hybrid-ingestion-canonicalization.mdapp/services/hybrid_ingestion.pyapp/services/ingestion.pydb/migrations/20260301000030_hybrid_ingestion_canonicalization.sqldb/bootstrap.sqltests/services/test_hybrid_ingestion.py
Related issue
#27-[Backend][Ingestion] Hybrid index + English canonicalization in ingest endpoint
Status
Backfilled planning stub