Skip to content

Scraper Metadata Contract (Issue #24)

This document defines the required metadata emitted by curriculum scrapers before persistence/ingestion handoff.

Contract Version

  • contract_version: v1

Required Fields

Field Type Description
record_key string Deterministic key derived from source + curriculum_id
source_url string Canonical URL of the source artifact (PDF/doc)
curriculum_id string Stable curriculum identifier (URL stem fallback to title slug)
curriculum_name string Human-readable title from scraped row
grade string Normalized grade/tier inference
subject string Normalized subject inference
language string Detected language label
version_hash string SHA-256 hash over key metadata fields
crawl_timestamp string (ISO 8601) Capture timestamp at scrape time
lineage object Source lineage chain

lineage object

  • sitemap_url: root sitemap where page discovery started
  • page_url: curriculum list page where document was found
  • source_url: final document URL

Validation Rules

Validation is performed in app/services/scrapers/metadata.py:

  1. Required fields must exist
  2. lineage must be an object
  3. lineage must include sitemap_url, page_url, source_url
  4. Key string fields must be non-empty

Determinism Rule

  • record_key = sha256("{source}:{curriculum_id}")[:20]
  • This guarantees deterministic identity for dedupe/checkpoint continuity.

Checkpoint + Resume Integration

Checkpoint state records completed/failed URLs and attempt counters per URL. On rerun:

  • completed URLs are skipped
  • failed URLs are retried using exponential backoff
  • processing continues for other curriculum pages (failure isolation)

Implementation: app/services/scrapers/checkpoint.py