ADR-0003: Hybrid Retrieval + Canonical English Pipeline for Multilingual Chat
- Status: Proposed
- Date: 2026-03-01
- Owners: Mido (Architect), Abdou (Lead Review)
Context
Backend must support multilingual prompts while maintaining high retrieval precision and predictable ranking behavior. Current workflow needs explicit contract from scraping to ingestion and from retrieval to final reasoning response.
Decision
- Use
reference_idas the canonical handoff contract from persisted scrape data to ingestion. - Ingest each reference into both:
- semantic index (Pinecone)
- lexical search index (Supabase/Postgres FTS)
- Maintain canonical English representation for retrieval/LLM synthesis while preserving source/original language metadata.
- Use weighted hybrid retrieval target of 75% semantic and 25% lexical.
- Apply RRF + BM25-aware fusion and user-context reranking before final synthesis.
- Generate final response with OpenAI reasoning model, then translate output back to user language/format.
Consequences
Positive
- Better multilingual consistency with single canonical retrieval language.
- Stronger retrieval robustness by combining semantic + lexical signals.
- Clear observability via
reference_idlineage.
Trade-offs
- Additional indexing/storage overhead.
- More moving parts in ranking and translation pipeline.
- Requires strict idempotency controls to prevent duplicate ingestion artifacts.
Follow-up
- Implement through issues #24-#32.
- Revisit weight tuning after evaluation dataset is available.