Ingestion Operations
Operating model
Production ingestion currently requires admin-triggered API calls. There is no autonomous worker deployment in this branch.
The normal operational sequence is:
- sync a source
- inspect discovered references and handoffs
- dispatch queued ingestion jobs
- inspect job status
- requeue failed jobs if needed
Required auth
Use an admin JWT accepted by require_admin in app/core/auth.py.
Source sync
Current supported source:
koutoubi
Trigger a sync:
Inspect results:
GET /scraping/koutoubi/runsGET /scraping/koutoubi/referencesGET /scraping/koutoubi/handoffs
Queue and dispatch jobs
Queue a specific reference:
Dispatch the oldest queued job:
Inspect jobs:
GET /admin/jobsGET /admin/jobs/{job_id}
Requeue a failed job:
Expected status progression
Successful jobs move through:
queuedparsingtokenizingembedding_request_sentembedding_upsertedready
Inspect ingestion_audit when a job stalls or fails between stages.
Manual document cleanup
To remove a document and its vectors:
This removes chunk rows and attempts Pinecone deletion, then resets the linked reference status to discovered.
Legacy manual upload
POST /admin/upload-curriculum is available for manual PDF upload, but it bypasses the full hybrid ingestion contract. Use it only when the normal scrape-reference-job path is not viable.