Skip to content

Plan: GH-0025 Persist Scraped Curriculum Metadata to Supabase

Context

This issue turns scraper output into durable backend records so later ingestion and retrieval work can rely on references as the system handoff boundary.

Problem

Scraped curriculum metadata must be persisted idempotently, with enough traceability to support replay, dedupe, and downstream ingestion decisions.

Current state in repo

  • app/services/scraper_service.py writes scrape runs and references.
  • db/migrations/20260301000028_scrape_metadata_persistence.sql adds persistence fields such as record_key, version_hash, lineage, and scrape_run_id.
  • db/bootstrap.sql includes scrape_runs and references.
  • Source output already includes metadata contract fields in the scraper layer.

Target state

  • Every discovered curriculum record is persisted with stable identity and traceability fields.
  • Re-running the same scrape is idempotent at the reference layer.
  • references can act as the authoritative queue input for ingestion work.

Constraints

  • Backend-only scope.
  • Persistence must work with Supabase/Postgres schema already present in db/.
  • The design must preserve deterministic keys and replay safety.
  • Schema and code paths must remain compatible with current tests and migrations.

Proposed approach

  1. Persist metadata contract fields into references.
  2. Tie persisted references back to scrape_runs for observability.
  3. Use stable keys and version hashes to prevent accidental duplicate inserts.
  4. Treat references as the durable handoff table for later ingestion jobs.

Risks

  • Idempotency bugs can create duplicate references.
  • Partial metadata backfills can leave mixed legacy and v1 rows.
  • Weak uniqueness rules can make reruns nondeterministic.

Open questions

  • Should metadata persistence upsert by (source, record_key) only, or also consider version_hash transitions?
  • How much legacy backfill should remain in scope for this issue versus follow-up work?

Acceptance criteria

  • A plan doc exists for #25 under docs/plans/.
  • The doc identifies references persistence and idempotency as the main goal.
  • The doc names the current migration and service touchpoints.
  • The plan stays within backend-only scope.

Files likely to change

  • docs/plans/gh-0025-persist-scraped-curriculum-metadata.md
  • app/services/scraper_service.py
  • db/migrations/20260301000028_scrape_metadata_persistence.sql
  • db/bootstrap.sql
  • tests/services/test_scraper_service.py
  • #25 - [Backend][Data] Persist scraped curriculum metadata to Supabase (idempotent)

Status

Backfilled planning stub