Skip to content

Plan: GH-0027 Hybrid Ingestion and Canonicalization

Context

This issue upgrades ingestion so one run produces both semantic and lexical retrieval artifacts while preserving source-language provenance.

Problem

The backend needs ingestion output that supports hybrid retrieval, but retrieval quality and traceability suffer if canonicalization and storage boundaries are inconsistent.

Current state in repo

  • app/services/hybrid_ingestion.py prepares canonical-English semantic and lexical payloads.
  • app/services/ingestion.py coordinates ingestion jobs.
  • db/migrations/20260301000030_hybrid_ingestion_canonicalization.sql adds canonicalization and hybrid fields.
  • tests/services/test_hybrid_ingestion.py covers canonicalization and payload generation.

Target state

  • Each ingestion run produces aligned semantic and lexical artifacts.
  • Canonical English text is stored for retrieval while source-language provenance remains available.
  • Hybrid readiness is explicit at the document and chunk level.

Constraints

  • Backend-only scope.
  • Postgres remains the canonical store for chunk text and provenance.
  • Pinecone metadata should stay lightweight.
  • The ingestion path must remain idempotent and rollback-safe on partial failures.

Proposed approach

  1. Canonicalize chunk content into English while preserving original text metadata.
  2. Build Pinecone semantic payloads and Postgres lexical rows from the same chunk set.
  3. Verify payload count alignment before finalizing the ingestion run.
  4. Roll back semantic writes if lexical persistence fails later in the pipeline.

Risks

  • Translation fallback can reduce retrieval quality if canonicalization is poor.
  • Partial failures can leave hybrid state inconsistent if rollback paths are incomplete.
  • Extra ingestion work can increase latency and operational cost.

Open questions

  • Should canonicalization always target English, or should this be configurable later?
  • Which provider and confidence thresholds justify fallback behavior?

Acceptance criteria

  • A plan doc exists for #27 under docs/plans/.
  • The doc identifies dual semantic and lexical outputs as required.
  • The doc states provenance preservation and idempotent reruns as constraints.
  • The current migration and test touchpoints are named.

Files likely to change

  • docs/plans/gh-0027-hybrid-ingestion-canonicalization.md
  • app/services/hybrid_ingestion.py
  • app/services/ingestion.py
  • db/migrations/20260301000030_hybrid_ingestion_canonicalization.sql
  • db/bootstrap.sql
  • tests/services/test_hybrid_ingestion.py
  • #27 - [Backend][Ingestion] Hybrid index + English canonicalization in ingest endpoint

Status

Backfilled planning stub