Plan: GH-0025 Persist Scraped Curriculum Metadata to Supabase
Context
This issue turns scraper output into durable backend records so later ingestion and retrieval work can rely on references as the system handoff boundary.
Problem
Scraped curriculum metadata must be persisted idempotently, with enough traceability to support replay, dedupe, and downstream ingestion decisions.
Current state in repo
app/services/scraper_service.pywrites scrape runs and references.db/migrations/20260301000028_scrape_metadata_persistence.sqladds persistence fields such asrecord_key,version_hash,lineage, andscrape_run_id.db/bootstrap.sqlincludesscrape_runsandreferences.- Source output already includes metadata contract fields in the scraper layer.
Target state
- Every discovered curriculum record is persisted with stable identity and traceability fields.
- Re-running the same scrape is idempotent at the reference layer.
referencescan act as the authoritative queue input for ingestion work.
Constraints
- Backend-only scope.
- Persistence must work with Supabase/Postgres schema already present in
db/. - The design must preserve deterministic keys and replay safety.
- Schema and code paths must remain compatible with current tests and migrations.
Proposed approach
- Persist metadata contract fields into
references. - Tie persisted references back to
scrape_runsfor observability. - Use stable keys and version hashes to prevent accidental duplicate inserts.
- Treat
referencesas the durable handoff table for later ingestion jobs.
Risks
- Idempotency bugs can create duplicate references.
- Partial metadata backfills can leave mixed legacy and v1 rows.
- Weak uniqueness rules can make reruns nondeterministic.
Open questions
- Should metadata persistence upsert by
(source, record_key)only, or also considerversion_hashtransitions? - How much legacy backfill should remain in scope for this issue versus follow-up work?
Acceptance criteria
- A plan doc exists for
#25underdocs/plans/. - The doc identifies
referencespersistence and idempotency as the main goal. - The doc names the current migration and service touchpoints.
- The plan stays within backend-only scope.
Files likely to change
docs/plans/gh-0025-persist-scraped-curriculum-metadata.mdapp/services/scraper_service.pydb/migrations/20260301000028_scrape_metadata_persistence.sqldb/bootstrap.sqltests/services/test_scraper_service.py
Related issue
#25-[Backend][Data] Persist scraped curriculum metadata to Supabase (idempotent)
Status
Backfilled planning stub