Hybrid Retrieval Fusion (Issue #29)

Summary

The retrieval pipeline now builds two candidate sets for each query:

Semantic candidates from Pinecone (dense vectors)
Lexical candidates from Supabase/Postgres full-text search (search_chunks_lexical RPC)

Candidates are blended with a 75/25 semantic/lexical target and then ordered with weighted Reciprocal Rank Fusion (RRF).

Scoring model

1) Weighted blend score (observability + confidence)

For each candidate document d:

semantic_norm(d) = semantic_raw(d) / max_semantic_raw
lexical_norm(d) = lexical_raw(d) / max_lexical_raw

Then:

weighted_blend_score(d) = 0.75 * semantic_norm(d) + 0.25 * lexical_norm(d)

This is surfaced as result.score for downstream compatibility.

2) Weighted RRF score (final ordering)

With k = 60, ranks starting at 1:

rrf_score(d) = 0.75 / (k + rank_semantic(d)) + 0.25 / (k + rank_lexical(d))

If a candidate appears in only one list, the missing term is 0.

Final ordering sorts by:

rrf_score (descending)
weighted_blend_score (descending)
raw semantic/lexical scores (descending tie-break)

Retrieval traceability logs

Each retrieval call logs:

candidate counts (semantic, lexical, fused) and fusion-stage latency (fusion_latency_ms)
top raw semantic and lexical scores
top fused outputs with id, blend score, rrf score, and source contributions

Log markers:

HYBRID_RETRIEVAL candidates ...
HYBRID_RETRIEVAL semantic_scores=... lexical_scores=...
HYBRID_RETRIEVAL fusion_top=...

Each returned chunk also includes:

metadata.retrieval_sources (["semantic"], ["lexical"], or both)
metadata.hybrid_fusion with rank/score breakdown and RRF components

Latency regression gate (CI)

To keep hybrid fusion release-stable, we enforce a deterministic p95 latency gate in tests/services/test_retrieval_pipeline.py:

Test: test_hybrid_fusion_latency_regression_gate_p95_within_budget
Workload: 100 semantic candidates + 80 lexical candidates (with overlap), top_k=25
Iterations: RetrievalPipeline.HYBRID_FUSION_LATENCY_BENCHMARK_ITERATIONS (60)
Threshold: RetrievalPipeline.HYBRID_FUSION_LATENCY_BUDGET_MS (20.0 ms p95)

Measurement method:

Run warm-up fusion passes to reduce interpreter cold-start jitter.
Measure each _fuse_hybrid_candidates(...) call using perf_counter().
Compute p95 over the sampled durations and fail CI if p95 exceeds the budget.

This gate is intentionally scoped to the fusion stage (not network I/O) so regressions in ranking complexity are caught reliably in unit-test conditions.

DB function

Migration: db/migrations/20260301000031_hybrid_retrieval_lexical_rpc.sql

Adds RPC function:

search_chunks_lexical(p_query_text, p_match_count, p_grade, p_subject, p_language)

Returns per-hit:

chunk_id
bm25_score (ts_rank_cd-based lexical score)
metadata payload (grade, subject, source URL, page, canonicalization/BM25 fields, preview)