Scraper Metadata Contract (Issue #24)
This document defines the required metadata emitted by curriculum scrapers before persistence/ingestion handoff.
Contract Version
contract_version:v1
Required Fields
| Field | Type | Description |
|---|---|---|
record_key |
string | Deterministic key derived from source + curriculum_id |
source_url |
string | Canonical URL of the source artifact (PDF/doc) |
curriculum_id |
string | Stable curriculum identifier (URL stem fallback to title slug) |
curriculum_name |
string | Human-readable title from scraped row |
grade |
string | Normalized grade/tier inference |
subject |
string | Normalized subject inference |
language |
string | Detected language label |
version_hash |
string | SHA-256 hash over key metadata fields |
crawl_timestamp |
string (ISO 8601) | Capture timestamp at scrape time |
lineage |
object | Source lineage chain |
lineage object
sitemap_url: root sitemap where page discovery startedpage_url: curriculum list page where document was foundsource_url: final document URL
Validation Rules
Validation is performed in app/services/scrapers/metadata.py:
- Required fields must exist
lineagemust be an objectlineagemust includesitemap_url,page_url,source_url- Key string fields must be non-empty
Determinism Rule
record_key = sha256("{source}:{curriculum_id}")[:20]- This guarantees deterministic identity for dedupe/checkpoint continuity.
Checkpoint + Resume Integration
Checkpoint state records completed/failed URLs and attempt counters per URL. On rerun:
- completed URLs are skipped
- failed URLs are retried using exponential backoff
- processing continues for other curriculum pages (failure isolation)
Implementation: app/services/scrapers/checkpoint.py