Skip to content

Backend Blueprint v1 — Scrape → Ingest → Hybrid Retrieval → Multilingual Chat

Scope

Backend only (app/, db/, tests/, docs/). Admin dashboard is intentionally excluded for this release cycle.

North Star

Reliable curriculum pipeline that: 1. Scrapes curriculum sources robustly. 2. Stores complete metadata + content in Supabase. 3. Ingests by reference_id into both semantic and lexical indexes. 4. Serves multilingual chat with weighted hybrid retrieval and grounded reasoning.

System Flow

A) Scraping & Metadata Capture

  • Trigger scrape job per source/curriculum.
  • For each discovered curriculum page/document, capture:
  • source_url, source_name, curriculum_key, grade, subject, language
  • content hash/version, timestamps, scrape run id, extraction status
  • traceability fields (parent_url, section, page index, parser version)
  • Persist records in Supabase (references, scrape_runs, supporting metadata tables).

B) Orchestration to Ingestion

  • Save record first.
  • Use saved reference_id as the only ingestion handoff contract.
  • Enqueue ingestion job with retry-safe lifecycle: queued -> running -> completed|failed.

C) Ingestion Endpoint (Hybrid + Canonical English)

  • Load reference by reference_id.
  • Normalize/standardize content to canonical English while preserving original text metadata.
  • Write semantic artifacts (embeddings/Pinecone).
  • Write lexical artifacts (FTS-ready text in Postgres/Supabase).
  • Guarantee idempotency on repeated reference_id ingestion.

D) Chat Request Pipeline

  1. Detect user prompt language.
  2. If not English, translate prompt to English canonical query.
  3. Run hybrid retrieval:
  4. semantic retrieval target weight: 75%
  5. lexical retrieval target weight: 25%
  6. Combine ranks/signals using RRF + BM25-aware scoring.
  7. Pull user profile context (grade/subject/tier) for reranking.
  8. Use OpenAI reasoning model for grounded answer synthesis.
  9. Translate final answer back to original prompt language/style.

Ranking Strategy (Initial)

  • Retrieval candidates:
  • semantic top-k from Pinecone
  • lexical top-k from Postgres FTS/BM25
  • Fusion:
  • RRF across ranked lists
  • weighted score blend to preserve 75/25 behavior
  • Rerank features:
  • user grade alignment
  • subject match
  • source quality/recency

Reliability & Safety

  • Idempotent processing keyed by reference_id + content hash.
  • Retry with exponential backoff and capped attempts.
  • Dead-letter queue/state for repeated failures.
  • Structured logs with request_id, reference_id, scrape_run_id.

Verify Gate (Release)

  • End-to-end smoke path: scrape sample curriculum -> ingest -> query chat.
  • Metrics:
  • scrape success rate
  • ingestion latency/success
  • retrieval latency and hit quality
  • translation fallback frequency
  • Regression tests for multilingual roundtrip and ranking behavior.

Issue Mapping

  • Plan: #23
  • Scrape/Persist/Orchestrate: #24, #25, #26
  • Ingest hybrid + canonicalization: #27
  • Chat language + hybrid retrieval + rerank + synthesis: #28, #29, #30, #31
  • Verification: #32