ADR-0001: Chapter-Aware Curriculum Ingestion

Status: Proposed
Date: 2026-04-21
Deciders: Soueilem (@lebjawi)

Context

The current BacMR ingestion pipeline treats each curriculum PDF as a single document. Chunks are created using fixed token-size splitting (512 tokens with overlap), which means:

A query about "Chapter 3" returns scattered chunks from across chapter boundaries
There is no concept of chapter or lesson-level granularity
Students cannot retrieve "all of Chapter 3" as a coherent unit
Sub-chapter structure (sections, lessons, units) is invisible to retrieval

With 10+ curricula across 3 languages (French, English, Arabic), the system needs chapter-aware ingestion that:

Splits curricula into chapter/lesson-level units before chunking
Preserves chapter metadata through the entire pipeline
Handles the unique structure of each language's curriculum format
Provides fallback behavior when TOC extraction fails

Decision

We will implement a pre-processing ChapterSplitter that runs before the existing IngestionService. The existing ingestion pipeline remains unchanged — only the input changes (one chapter PDF instead of one full curriculum PDF).

Chapter Splitting Strategy

Split unit: Chapter or lesson (whichever is the natural unit in each curriculum)
Split identification: Pattern-based detection per language
Fallback layers: TOC extraction → Header scan → Heuristic split → Full document
Language support: French, English, Arabic (RTL)

Asset Extraction Strategy

Extract visual content (images, tables, diagrams) alongside text
Classify content vs decorative images
Generate semantic descriptions for content-relevant images
Preserve original images for display; store text for searchability

Consequences

Positive

Chapter-level retrieval granularity
Visual content preserved and searchable
Handles multi-language curricula uniformly
Existing ingestion pipeline unchanged

Negative

Additional pre-processing step adds latency to ingestion
More complex rollback if chapter boundaries are wrong
Asset extraction requires vision model API calls

Risks

Arabic RTL text extraction may be inconsistent
Some curricula may have non-standard chapter structures
TOC may be missing or unreliable in elementary books