Ingestion Operations

Operating model

Production ingestion currently requires admin-triggered API calls. There is no autonomous worker deployment in this branch.

The normal operational sequence is:

sync a source
inspect discovered references and handoffs
dispatch queued ingestion jobs
inspect job status
requeue failed jobs if needed

Required auth

Use an admin JWT accepted by require_admin in app/core/auth.py.

Source sync

Current supported source:

koutoubi

Trigger a sync:

curl -X POST "$API_URL/scraping/koutoubi/sync" \
  -H "Authorization: Bearer $ADMIN_JWT"

Inspect results:

GET /scraping/koutoubi/runs
GET /scraping/koutoubi/references
GET /scraping/koutoubi/handoffs

Queue and dispatch jobs

Queue a specific reference:

curl -X POST "$API_URL/admin/ingest/$REFERENCE_ID" \
  -H "Authorization: Bearer $ADMIN_JWT"

Dispatch the oldest queued job:

curl -X POST "$API_URL/admin/jobs/dispatch" \
  -H "Authorization: Bearer $ADMIN_JWT"

Inspect jobs:

GET /admin/jobs
GET /admin/jobs/{job_id}

Requeue a failed job:

curl -X POST "$API_URL/admin/jobs/$JOB_ID/requeue" \
  -H "Authorization: Bearer $ADMIN_JWT"

Expected status progression

Successful jobs move through:

queued
parsing
tokenizing
embedding_request_sent
embedding_upserted
ready

Inspect ingestion_audit when a job stalls or fails between stages.

Manual document cleanup

To remove a document and its vectors:

curl -X DELETE "$API_URL/admin/documents/$DOCUMENT_ID" \
  -H "Authorization: Bearer $ADMIN_JWT"

This removes chunk rows and attempts Pinecone deletion, then resets the linked reference status to discovered.

Legacy manual upload

POST /admin/upload-curriculum is available for manual PDF upload, but it bypasses the full hybrid ingestion contract. Use it only when the normal scrape-reference-job path is not viable.