ADR-0001: Chapter-Aware Curriculum Ingestion
Status: Proposed
Date: 2026-04-21
Deciders: Soueilem (@lebjawi)
Context
The current BacMR ingestion pipeline treats each curriculum PDF as a single document. Chunks are created using fixed token-size splitting (512 tokens with overlap), which means:
- A query about "Chapter 3" returns scattered chunks from across chapter boundaries
- There is no concept of chapter or lesson-level granularity
- Students cannot retrieve "all of Chapter 3" as a coherent unit
- Sub-chapter structure (sections, lessons, units) is invisible to retrieval
With 10+ curricula across 3 languages (French, English, Arabic), the system needs chapter-aware ingestion that:
- Splits curricula into chapter/lesson-level units before chunking
- Preserves chapter metadata through the entire pipeline
- Handles the unique structure of each language's curriculum format
- Provides fallback behavior when TOC extraction fails
Decision
We will implement a pre-processing ChapterSplitter that runs before the existing IngestionService. The existing ingestion pipeline remains unchanged — only the input changes (one chapter PDF instead of one full curriculum PDF).
Chapter Splitting Strategy
- Split unit: Chapter or lesson (whichever is the natural unit in each curriculum)
- Split identification: Pattern-based detection per language
- Fallback layers: TOC extraction → Header scan → Heuristic split → Full document
- Language support: French, English, Arabic (RTL)
Asset Extraction Strategy
- Extract visual content (images, tables, diagrams) alongside text
- Classify content vs decorative images
- Generate semantic descriptions for content-relevant images
- Preserve original images for display; store text for searchability
Consequences
Positive
- Chapter-level retrieval granularity
- Visual content preserved and searchable
- Handles multi-language curricula uniformly
- Existing ingestion pipeline unchanged
Negative
- Additional pre-processing step adds latency to ingestion
- More complex rollback if chapter boundaries are wrong
- Asset extraction requires vision model API calls
Risks
- Arabic RTL text extraction may be inconsistent
- Some curricula may have non-standard chapter structures
- TOC may be missing or unreliable in elementary books