Plan: GH-0026 Reference ID Ingestion Handoff
Context
This issue formalizes the backend handoff from persisted scrape records into ingestion jobs using reference_id as the only supported queue contract.
Problem
Without a durable handoff layer, scrape persistence and ingestion can drift apart, making retries, dedupe, and operational tracing harder than they need to be.
Current state in repo
app/services/ingestion_handoff.pyexists and has tests intests/services/test_ingestion_handoff.py.db/migrations/20260301000029_ingestion_handoffs.sqlis present.app/services/ingestion.pymanages ingestion jobs.docs/90_ops/ingestion_handoff_orchestration.mddocuments the implemented lifecycle.
Target state
- Scrape persistence can enqueue ingestion work through a durable
reference_idcontract. - Handoff rows are idempotent, observable, and retry-safe.
- Operators can inspect handoff state without reading raw job tables only.
Constraints
- Backend-only scope.
- Handoff must use persisted references rather than transient scraper payloads.
- The design must preserve replay safety and avoid duplicate active jobs.
- The plan should fit the current migrations and service boundaries.
Proposed approach
- Create or update a handoff record per
(reference_id, payload_hash). - Queue ingestion jobs from that handoff layer instead of directly from scraper memory state.
- Expose handoff lifecycle, counts, and reason codes for operational use.
Risks
- Duplicate job creation if dedupe logic is incomplete.
- Ambiguous retry policy can produce noisy failed states.
- Handoff rows can become another stale layer if status transitions are not maintained.
Open questions
- Should handoff orchestration remain synchronous with scrape sync or become asynchronous later?
- Which failure classes should retry automatically versus fail fast?
Acceptance criteria
- A plan doc exists for
#26underdocs/plans/. - The doc defines
reference_idas the ingestion handoff boundary. - The doc calls out idempotency, reason codes, and retry behavior.
- The doc lists the current service and migration touchpoints.
Files likely to change
docs/plans/gh-0026-reference-id-ingestion-handoff.mdapp/services/ingestion_handoff.pyapp/services/ingestion.pyapp/api/routers/scraping.pydb/migrations/20260301000029_ingestion_handoffs.sqltests/services/test_ingestion_handoff.py
Related issue
#26-[Backend][Ingestion] Orchestrate reference_id handoff from scrape storage
Status
Backfilled planning stub