Backend Blueprint v1 — Scrape → Ingest → Hybrid Retrieval → Multilingual Chat
Scope
Backend only (app/, db/, tests/, docs/).
Admin dashboard is intentionally excluded for this release cycle.
North Star
Reliable curriculum pipeline that:
1. Scrapes curriculum sources robustly.
2. Stores complete metadata + content in Supabase.
3. Ingests by reference_id into both semantic and lexical indexes.
4. Serves multilingual chat with weighted hybrid retrieval and grounded reasoning.
System Flow
A) Scraping & Metadata Capture
- Trigger scrape job per source/curriculum.
- For each discovered curriculum page/document, capture:
source_url,source_name,curriculum_key,grade,subject,language- content hash/version, timestamps, scrape run id, extraction status
- traceability fields (
parent_url, section, page index, parser version) - Persist records in Supabase (
references,scrape_runs, supporting metadata tables).
B) Orchestration to Ingestion
- Save record first.
- Use saved
reference_idas the only ingestion handoff contract. - Enqueue ingestion job with retry-safe lifecycle:
queued -> running -> completed|failed.
C) Ingestion Endpoint (Hybrid + Canonical English)
- Load reference by
reference_id. - Normalize/standardize content to canonical English while preserving original text metadata.
- Write semantic artifacts (embeddings/Pinecone).
- Write lexical artifacts (FTS-ready text in Postgres/Supabase).
- Guarantee idempotency on repeated
reference_idingestion.
D) Chat Request Pipeline
- Detect user prompt language.
- If not English, translate prompt to English canonical query.
- Run hybrid retrieval:
- semantic retrieval target weight: 75%
- lexical retrieval target weight: 25%
- Combine ranks/signals using RRF + BM25-aware scoring.
- Pull user profile context (grade/subject/tier) for reranking.
- Use OpenAI reasoning model for grounded answer synthesis.
- Translate final answer back to original prompt language/style.
Ranking Strategy (Initial)
- Retrieval candidates:
- semantic top-k from Pinecone
- lexical top-k from Postgres FTS/BM25
- Fusion:
- RRF across ranked lists
- weighted score blend to preserve 75/25 behavior
- Rerank features:
- user grade alignment
- subject match
- source quality/recency
Reliability & Safety
- Idempotent processing keyed by
reference_id+ content hash. - Retry with exponential backoff and capped attempts.
- Dead-letter queue/state for repeated failures.
- Structured logs with
request_id,reference_id,scrape_run_id.
Verify Gate (Release)
- End-to-end smoke path: scrape sample curriculum -> ingest -> query chat.
- Metrics:
- scrape success rate
- ingestion latency/success
- retrieval latency and hit quality
- translation fallback frequency
- Regression tests for multilingual roundtrip and ranking behavior.
Issue Mapping
- Plan: #23
- Scrape/Persist/Orchestrate: #24, #25, #26
- Ingest hybrid + canonicalization: #27
- Chat language + hybrid retrieval + rerank + synthesis: #28, #29, #30, #31
- Verification: #32