Scraper Metadata Contract (Issue #24)

This document defines the required metadata emitted by curriculum scrapers before persistence/ingestion handoff.

Contract Version

Field	Type	Description
`record_key`	string	Deterministic key derived from `source + curriculum_id`
`source_url`	string	Canonical URL of the source artifact (PDF/doc)
`curriculum_id`	string	Stable curriculum identifier (URL stem fallback to title slug)
`curriculum_name`	string	Human-readable title from scraped row
`grade`	string	Normalized grade/tier inference
`subject`	string	Normalized subject inference
`language`	string	Detected language label
`version_hash`	string	SHA-256 hash over key metadata fields
`crawl_timestamp`	string (ISO 8601)	Capture timestamp at scrape time
`lineage`	object	Source lineage chain

Validation is performed in app/services/scrapers/metadata.py:

Checkpoint state records completed/failed URLs and attempt counters per URL. On rerun:

Implementation: app/services/scrapers/checkpoint.py