Skip to content

Plan: GH-0026 Reference ID Ingestion Handoff

Context

This issue formalizes the backend handoff from persisted scrape records into ingestion jobs using reference_id as the only supported queue contract.

Problem

Without a durable handoff layer, scrape persistence and ingestion can drift apart, making retries, dedupe, and operational tracing harder than they need to be.

Current state in repo

  • app/services/ingestion_handoff.py exists and has tests in tests/services/test_ingestion_handoff.py.
  • db/migrations/20260301000029_ingestion_handoffs.sql is present.
  • app/services/ingestion.py manages ingestion jobs.
  • docs/90_ops/ingestion_handoff_orchestration.md documents the implemented lifecycle.

Target state

  • Scrape persistence can enqueue ingestion work through a durable reference_id contract.
  • Handoff rows are idempotent, observable, and retry-safe.
  • Operators can inspect handoff state without reading raw job tables only.

Constraints

  • Backend-only scope.
  • Handoff must use persisted references rather than transient scraper payloads.
  • The design must preserve replay safety and avoid duplicate active jobs.
  • The plan should fit the current migrations and service boundaries.

Proposed approach

  1. Create or update a handoff record per (reference_id, payload_hash).
  2. Queue ingestion jobs from that handoff layer instead of directly from scraper memory state.
  3. Expose handoff lifecycle, counts, and reason codes for operational use.

Risks

  • Duplicate job creation if dedupe logic is incomplete.
  • Ambiguous retry policy can produce noisy failed states.
  • Handoff rows can become another stale layer if status transitions are not maintained.

Open questions

  • Should handoff orchestration remain synchronous with scrape sync or become asynchronous later?
  • Which failure classes should retry automatically versus fail fast?

Acceptance criteria

  • A plan doc exists for #26 under docs/plans/.
  • The doc defines reference_id as the ingestion handoff boundary.
  • The doc calls out idempotency, reason codes, and retry behavior.
  • The doc lists the current service and migration touchpoints.

Files likely to change

  • docs/plans/gh-0026-reference-id-ingestion-handoff.md
  • app/services/ingestion_handoff.py
  • app/services/ingestion.py
  • app/api/routers/scraping.py
  • db/migrations/20260301000029_ingestion_handoffs.sql
  • tests/services/test_ingestion_handoff.py
  • #26 - [Backend][Ingestion] Orchestrate reference_id handoff from scrape storage

Status

Backfilled planning stub