Hybrid Retrieval Fusion (Issue #29)
Summary
The retrieval pipeline now builds two candidate sets for each query:
- Semantic candidates from Pinecone (dense vectors)
- Lexical candidates from Supabase/Postgres full-text search (
search_chunks_lexicalRPC)
Candidates are blended with a 75/25 semantic/lexical target and then ordered with weighted Reciprocal Rank Fusion (RRF).
Scoring model
1) Weighted blend score (observability + confidence)
For each candidate document d:
semantic_norm(d) = semantic_raw(d) / max_semantic_rawlexical_norm(d) = lexical_raw(d) / max_lexical_raw
Then:
weighted_blend_score(d) = 0.75 * semantic_norm(d) + 0.25 * lexical_norm(d)
This is surfaced as result.score for downstream compatibility.
2) Weighted RRF score (final ordering)
With k = 60, ranks starting at 1:
rrf_score(d) = 0.75 / (k + rank_semantic(d)) + 0.25 / (k + rank_lexical(d))
If a candidate appears in only one list, the missing term is 0.
Final ordering sorts by:
rrf_score(descending)weighted_blend_score(descending)- raw semantic/lexical scores (descending tie-break)
Retrieval traceability logs
Each retrieval call logs:
- candidate counts (
semantic,lexical,fused) and fusion-stage latency (fusion_latency_ms) - top raw semantic and lexical scores
- top fused outputs with id, blend score, rrf score, and source contributions
Log markers:
HYBRID_RETRIEVAL candidates ...HYBRID_RETRIEVAL semantic_scores=... lexical_scores=...HYBRID_RETRIEVAL fusion_top=...
Each returned chunk also includes:
metadata.retrieval_sources(["semantic"],["lexical"], or both)metadata.hybrid_fusionwith rank/score breakdown and RRF components
Latency regression gate (CI)
To keep hybrid fusion release-stable, we enforce a deterministic p95 latency gate in
tests/services/test_retrieval_pipeline.py:
- Test:
test_hybrid_fusion_latency_regression_gate_p95_within_budget - Workload: 100 semantic candidates + 80 lexical candidates (with overlap),
top_k=25 - Iterations:
RetrievalPipeline.HYBRID_FUSION_LATENCY_BENCHMARK_ITERATIONS(60) - Threshold:
RetrievalPipeline.HYBRID_FUSION_LATENCY_BUDGET_MS(20.0 ms p95)
Measurement method:
- Run warm-up fusion passes to reduce interpreter cold-start jitter.
- Measure each
_fuse_hybrid_candidates(...)call usingperf_counter(). - Compute p95 over the sampled durations and fail CI if p95 exceeds the budget.
This gate is intentionally scoped to the fusion stage (not network I/O) so regressions in ranking complexity are caught reliably in unit-test conditions.
DB function
Migration: db/migrations/20260301000031_hybrid_retrieval_lexical_rpc.sql
Adds RPC function:
search_chunks_lexical(p_query_text, p_match_count, p_grade, p_subject, p_language)
Returns per-hit:
chunk_idbm25_score(ts_rank_cd-based lexical score)- metadata payload (grade, subject, source URL, page, canonicalization/BM25 fields, preview)