Plan: GH-0024 Robust Curriculum Scraper with Metadata and Checkpoints
Context
This issue covers reliable curriculum discovery for the backend pipeline, including deterministic metadata output and restart-safe scraping behavior.
Problem
Scrapers need to emit enough structured metadata for persistence and ingestion, while also supporting retries and checkpoint resume without duplicating work.
Current state in repo
app/services/scrapers/koutoubi.pyimplements source scraping with checkpoint support.app/services/scrapers/metadata.pybuilds and validates the metadata contract.app/services/scrapers/checkpoint.pypersists checkpoint state.tests/services/test_koutoubi_scraper.pyandtests/services/test_scraper_metadata.pycover key behavior.
Target state
- Scrapers emit a stable, validated metadata contract for every discovered artifact.
- Checkpoint and retry behavior is deterministic and safe across reruns.
- The scraper output is ready for persistence and later ingestion handoff.
Constraints
- Backend-only scope; no admin UI changes.
- The source-specific implementation must still fit the generic scraper contract.
- Metadata must be deterministic enough for dedupe and replay behavior.
- The plan must align with existing
referencestable usage.
Proposed approach
- Define the required scraper metadata contract and validation rules.
- Keep checkpoint state per source so reruns can skip completed URLs and isolate failures.
- Normalize discovered documents into a persistence-ready payload before they reach the database layer.
Risks
- Source HTML changes can silently break extraction quality.
- A weak metadata contract can cause duplicate references or unstable identities.
- Checkpoint files can preserve bad state if failure handling is incomplete.
Open questions
- Should checkpoint storage remain file-based or move fully into Postgres later?
- Which metadata fields are mandatory for every source versus source-specific extensions?
Acceptance criteria
- A plan doc exists for
#24underdocs/plans/. - The doc names metadata contract and checkpoint behavior as core outputs.
- The plan is explicitly backend-only.
- The likely code touchpoints for scraper contract work are identified.
Files likely to change
docs/plans/gh-0024-robust-curriculum-scraper.mdapp/services/scrapers/koutoubi.pyapp/services/scrapers/metadata.pyapp/services/scrapers/checkpoint.pytests/services/test_koutoubi_scraper.pytests/services/test_scraper_metadata.py
Related issue
#24-[Backend][Pipeline] Robust curriculum scraper with full metadata + checkpoints
Status
Backfilled planning stub