Skip to content

Plan: GH-0024 Robust Curriculum Scraper with Metadata and Checkpoints

Context

This issue covers reliable curriculum discovery for the backend pipeline, including deterministic metadata output and restart-safe scraping behavior.

Problem

Scrapers need to emit enough structured metadata for persistence and ingestion, while also supporting retries and checkpoint resume without duplicating work.

Current state in repo

  • app/services/scrapers/koutoubi.py implements source scraping with checkpoint support.
  • app/services/scrapers/metadata.py builds and validates the metadata contract.
  • app/services/scrapers/checkpoint.py persists checkpoint state.
  • tests/services/test_koutoubi_scraper.py and tests/services/test_scraper_metadata.py cover key behavior.

Target state

  • Scrapers emit a stable, validated metadata contract for every discovered artifact.
  • Checkpoint and retry behavior is deterministic and safe across reruns.
  • The scraper output is ready for persistence and later ingestion handoff.

Constraints

  • Backend-only scope; no admin UI changes.
  • The source-specific implementation must still fit the generic scraper contract.
  • Metadata must be deterministic enough for dedupe and replay behavior.
  • The plan must align with existing references table usage.

Proposed approach

  1. Define the required scraper metadata contract and validation rules.
  2. Keep checkpoint state per source so reruns can skip completed URLs and isolate failures.
  3. Normalize discovered documents into a persistence-ready payload before they reach the database layer.

Risks

  • Source HTML changes can silently break extraction quality.
  • A weak metadata contract can cause duplicate references or unstable identities.
  • Checkpoint files can preserve bad state if failure handling is incomplete.

Open questions

  • Should checkpoint storage remain file-based or move fully into Postgres later?
  • Which metadata fields are mandatory for every source versus source-specific extensions?

Acceptance criteria

  • A plan doc exists for #24 under docs/plans/.
  • The doc names metadata contract and checkpoint behavior as core outputs.
  • The plan is explicitly backend-only.
  • The likely code touchpoints for scraper contract work are identified.

Files likely to change

  • docs/plans/gh-0024-robust-curriculum-scraper.md
  • app/services/scrapers/koutoubi.py
  • app/services/scrapers/metadata.py
  • app/services/scrapers/checkpoint.py
  • tests/services/test_koutoubi_scraper.py
  • tests/services/test_scraper_metadata.py
  • #24 - [Backend][Pipeline] Robust curriculum scraper with full metadata + checkpoints

Status

Backfilled planning stub