Skip to content

ADR-0003: Hybrid Retrieval + Canonical English Pipeline for Multilingual Chat

  • Status: Proposed
  • Date: 2026-03-01
  • Owners: Mido (Architect), Abdou (Lead Review)

Context

Backend must support multilingual prompts while maintaining high retrieval precision and predictable ranking behavior. Current workflow needs explicit contract from scraping to ingestion and from retrieval to final reasoning response.

Decision

  1. Use reference_id as the canonical handoff contract from persisted scrape data to ingestion.
  2. Ingest each reference into both:
  3. semantic index (Pinecone)
  4. lexical search index (Supabase/Postgres FTS)
  5. Maintain canonical English representation for retrieval/LLM synthesis while preserving source/original language metadata.
  6. Use weighted hybrid retrieval target of 75% semantic and 25% lexical.
  7. Apply RRF + BM25-aware fusion and user-context reranking before final synthesis.
  8. Generate final response with OpenAI reasoning model, then translate output back to user language/format.

Consequences

Positive

  • Better multilingual consistency with single canonical retrieval language.
  • Stronger retrieval robustness by combining semantic + lexical signals.
  • Clear observability via reference_id lineage.

Trade-offs

  • Additional indexing/storage overhead.
  • More moving parts in ranking and translation pipeline.
  • Requires strict idempotency controls to prevent duplicate ingestion artifacts.

Follow-up

  • Implement through issues #24-#32.
  • Revisit weight tuning after evaluation dataset is available.