Skip to content

Ingestion

Current ingestion topology

The branch has one primary ingestion path and one legacy path.

Primary path:

  1. /scraping/{source}/sync discovers source references
  2. references are persisted in references
  3. handoffs are created in ingestion_handoffs
  4. ingestion jobs are queued in ingestion_jobs
  5. /admin/jobs/dispatch starts the actual background processing
  6. _run_ingestion_job in app/api/routers/admin.py performs the pipeline

Legacy path:

  • /admin/upload-curriculum uploads a PDF directly and writes vectors without the full hybrid ingestion contract

The primary path should be treated as the production path.

Scraping and handoff

The scraping stack is implemented in:

  • app/api/routers/scraping.py
  • app/services/scrapers/
  • app/services/ingestion_handoff.py

Current supported source list:

  • koutoubi

The sync flow:

  • creates a scrape_runs row
  • scrapes and normalizes source metadata
  • validates required metadata contract fields
  • upserts references by source,record_key
  • creates idempotent ingestion_handoffs
  • queues ingestion_jobs unless a compatible active or completed job already exists

Ingestion job lifecycle

The intended job statuses are defined in app/services/ingestion.py and used by the admin route implementation:

  • queued
  • parsing
  • tokenizing
  • embedding_request_sent
  • embedding_upserted
  • ready
  • failed

State transitions are audited in ingestion_audit.

Hybrid indexing flow

The active reference ingestion path in _run_ingestion_job does the following:

  1. load the reference row
  2. download the PDF from reference.pdf_source
  3. extract pages and raw chunks
  4. canonicalize chunk text to English with HybridIngestionService
  5. build semantic vectors for Pinecone
  6. build lexical rows for Postgres chunks
  7. upsert vectors into Pinecone namespace grade-<grade>-<subject>
  8. upsert lexical rows into chunks
  9. update documents, references, and ingestion_jobs

The hybrid ingestion service keeps:

  • original_text
  • canonical_text
  • source and canonical language
  • canonicalization provider and confidence
  • lexical term statistics

Legacy manual upload

/admin/upload-curriculum is still present for manual operations, but it differs from the primary path:

  • it extracts chunks directly from the uploaded PDF
  • it writes chunk text into Pinecone metadata
  • it inserts a documents row without the same hybrid lexical workflow

Treat it as an operational escape hatch, not the preferred ingestion contract.