Backend Blueprint v1 — Scrape → Ingest → Hybrid Retrieval → Multilingual Chat

Scope

Backend only (app/, db/, tests/, docs/). Admin dashboard is intentionally excluded for this release cycle.

North Star

Reliable curriculum pipeline that: 1. Scrapes curriculum sources robustly. 2. Stores complete metadata + content in Supabase. 3. Ingests by reference_id into both semantic and lexical indexes. 4. Serves multilingual chat with weighted hybrid retrieval and grounded reasoning.

System Flow

A) Scraping & Metadata Capture

Trigger scrape job per source/curriculum.
For each discovered curriculum page/document, capture:
source_url, source_name, curriculum_key, grade, subject, language
content hash/version, timestamps, scrape run id, extraction status
traceability fields (parent_url, section, page index, parser version)
Persist records in Supabase (references, scrape_runs, supporting metadata tables).

B) Orchestration to Ingestion

Save record first.
Use saved reference_id as the only ingestion handoff contract.
Enqueue ingestion job with retry-safe lifecycle: queued -> running -> completed|failed.

C) Ingestion Endpoint (Hybrid + Canonical English)

Load reference by reference_id.
Normalize/standardize content to canonical English while preserving original text metadata.
Write semantic artifacts (embeddings/Pinecone).
Write lexical artifacts (FTS-ready text in Postgres/Supabase).
Guarantee idempotency on repeated reference_id ingestion.

D) Chat Request Pipeline

Detect user prompt language.
If not English, translate prompt to English canonical query.
Run hybrid retrieval:
semantic retrieval target weight: 75%
lexical retrieval target weight: 25%
Combine ranks/signals using RRF + BM25-aware scoring.
Pull user profile context (grade/subject/tier) for reranking.
Use OpenAI reasoning model for grounded answer synthesis.
Translate final answer back to original prompt language/style.

Ranking Strategy (Initial)

Retrieval candidates:
semantic top-k from Pinecone
lexical top-k from Postgres FTS/BM25
Fusion:
RRF across ranked lists
weighted score blend to preserve 75/25 behavior
Rerank features:
user grade alignment
subject match
source quality/recency

Reliability & Safety

Idempotent processing keyed by reference_id + content hash.
Retry with exponential backoff and capped attempts.
Dead-letter queue/state for repeated failures.
Structured logs with request_id, reference_id, scrape_run_id.

Verify Gate (Release)

End-to-end smoke path: scrape sample curriculum -> ingest -> query chat.
Metrics:
scrape success rate
ingestion latency/success
retrieval latency and hit quality
translation fallback frequency
Regression tests for multilingual roundtrip and ranking behavior.

Issue Mapping

Plan: #23
Scrape/Persist/Orchestrate: #24, #25, #26
Ingest hybrid + canonicalization: #27
Chat language + hybrid retrieval + rerank + synthesis: #28, #29, #30, #31
Verification: #32