Ingestion

Current ingestion topology

The branch has one primary ingestion path and one legacy path.

Primary path:

Legacy path:

/admin/upload-curriculum uploads a PDF directly and writes vectors without the full hybrid ingestion contract

The primary path should be treated as the production path.

The scraping stack is implemented in:

Current supported source list:

The sync flow:

creates a scrape_runs row
scrapes and normalizes source metadata
validates required metadata contract fields
upserts references by source,record_key
creates idempotent ingestion_handoffs
queues ingestion_jobs unless a compatible active or completed job already exists

The intended job statuses are defined in app/services/ingestion.py and used by the admin route implementation:

State transitions are audited in ingestion_audit.

The active reference ingestion path in _run_ingestion_job does the following:

The hybrid ingestion service keeps:

/admin/upload-curriculum is still present for manual operations, but it differs from the primary path:

Treat it as an operational escape hatch, not the preferred ingestion contract.