ADR-0003: Hybrid Retrieval + Canonical English Pipeline for Multilingual Chat

Status: Proposed
Date: 2026-03-01
Owners: Mido (Architect), Abdou (Lead Review)

Context

Backend must support multilingual prompts while maintaining high retrieval precision and predictable ranking behavior. Current workflow needs explicit contract from scraping to ingestion and from retrieval to final reasoning response.

Decision

Use reference_id as the canonical handoff contract from persisted scrape data to ingestion.
Ingest each reference into both:
semantic index (Pinecone)
lexical search index (Supabase/Postgres FTS)
Maintain canonical English representation for retrieval/LLM synthesis while preserving source/original language metadata.
Use weighted hybrid retrieval target of 75% semantic and 25% lexical.
Apply RRF + BM25-aware fusion and user-context reranking before final synthesis.
Generate final response with OpenAI reasoning model, then translate output back to user language/format.

Consequences

Positive

Better multilingual consistency with single canonical retrieval language.
Stronger retrieval robustness by combining semantic + lexical signals.
Clear observability via reference_id lineage.

Trade-offs

Additional indexing/storage overhead.
More moving parts in ranking and translation pipeline.
Requires strict idempotency controls to prevent duplicate ingestion artifacts.

Follow-up

Implement through issues #24-#32.
Revisit weight tuning after evaluation dataset is available.