Ingestion
Current ingestion topology
The branch has one primary ingestion path and one legacy path.
Primary path:
/scraping/{source}/syncdiscovers source references- references are persisted in
references - handoffs are created in
ingestion_handoffs - ingestion jobs are queued in
ingestion_jobs /admin/jobs/dispatchstarts the actual background processing_run_ingestion_jobinapp/api/routers/admin.pyperforms the pipeline
Legacy path:
/admin/upload-curriculumuploads a PDF directly and writes vectors without the full hybrid ingestion contract
The primary path should be treated as the production path.
Scraping and handoff
The scraping stack is implemented in:
app/api/routers/scraping.pyapp/services/scrapers/app/services/ingestion_handoff.py
Current supported source list:
koutoubi
The sync flow:
- creates a
scrape_runsrow - scrapes and normalizes source metadata
- validates required metadata contract fields
- upserts
referencesbysource,record_key - creates idempotent
ingestion_handoffs - queues
ingestion_jobsunless a compatible active or completed job already exists
Ingestion job lifecycle
The intended job statuses are defined in app/services/ingestion.py and used by the admin route implementation:
queuedparsingtokenizingembedding_request_sentembedding_upsertedreadyfailed
State transitions are audited in ingestion_audit.
Hybrid indexing flow
The active reference ingestion path in _run_ingestion_job does the following:
- load the reference row
- download the PDF from
reference.pdf_source - extract pages and raw chunks
- canonicalize chunk text to English with
HybridIngestionService - build semantic vectors for Pinecone
- build lexical rows for Postgres
chunks - upsert vectors into Pinecone namespace
grade-<grade>-<subject> - upsert lexical rows into
chunks - update
documents,references, andingestion_jobs
The hybrid ingestion service keeps:
original_textcanonical_text- source and canonical language
- canonicalization provider and confidence
- lexical term statistics
Legacy manual upload
/admin/upload-curriculum is still present for manual operations, but it differs from the primary path:
- it extracts chunks directly from the uploaded PDF
- it writes chunk text into Pinecone metadata
- it inserts a
documentsrow without the same hybrid lexical workflow
Treat it as an operational escape hatch, not the preferred ingestion contract.