Skip to content

Ingestion Operations

Operating model

Production ingestion currently requires admin-triggered API calls. There is no autonomous worker deployment in this branch.

The normal operational sequence is:

  1. sync a source
  2. inspect discovered references and handoffs
  3. dispatch queued ingestion jobs
  4. inspect job status
  5. requeue failed jobs if needed

Required auth

Use an admin JWT accepted by require_admin in app/core/auth.py.

Source sync

Current supported source:

  • koutoubi

Trigger a sync:

curl -X POST "$API_URL/scraping/koutoubi/sync" \
  -H "Authorization: Bearer $ADMIN_JWT"

Inspect results:

  • GET /scraping/koutoubi/runs
  • GET /scraping/koutoubi/references
  • GET /scraping/koutoubi/handoffs

Queue and dispatch jobs

Queue a specific reference:

curl -X POST "$API_URL/admin/ingest/$REFERENCE_ID" \
  -H "Authorization: Bearer $ADMIN_JWT"

Dispatch the oldest queued job:

curl -X POST "$API_URL/admin/jobs/dispatch" \
  -H "Authorization: Bearer $ADMIN_JWT"

Inspect jobs:

  • GET /admin/jobs
  • GET /admin/jobs/{job_id}

Requeue a failed job:

curl -X POST "$API_URL/admin/jobs/$JOB_ID/requeue" \
  -H "Authorization: Bearer $ADMIN_JWT"

Expected status progression

Successful jobs move through:

  • queued
  • parsing
  • tokenizing
  • embedding_request_sent
  • embedding_upserted
  • ready

Inspect ingestion_audit when a job stalls or fails between stages.

Manual document cleanup

To remove a document and its vectors:

curl -X DELETE "$API_URL/admin/documents/$DOCUMENT_ID" \
  -H "Authorization: Bearer $ADMIN_JWT"

This removes chunk rows and attempts Pinecone deletion, then resets the linked reference status to discovered.

Legacy manual upload

POST /admin/upload-curriculum is available for manual PDF upload, but it bypasses the full hybrid ingestion contract. Use it only when the normal scrape-reference-job path is not viable.