Skip to content

ADR-0001: Chapter-Aware Curriculum Ingestion

Status: Proposed
Date: 2026-04-21
Deciders: Soueilem (@lebjawi)


Context

The current BacMR ingestion pipeline treats each curriculum PDF as a single document. Chunks are created using fixed token-size splitting (512 tokens with overlap), which means:

  • A query about "Chapter 3" returns scattered chunks from across chapter boundaries
  • There is no concept of chapter or lesson-level granularity
  • Students cannot retrieve "all of Chapter 3" as a coherent unit
  • Sub-chapter structure (sections, lessons, units) is invisible to retrieval

With 10+ curricula across 3 languages (French, English, Arabic), the system needs chapter-aware ingestion that:

  1. Splits curricula into chapter/lesson-level units before chunking
  2. Preserves chapter metadata through the entire pipeline
  3. Handles the unique structure of each language's curriculum format
  4. Provides fallback behavior when TOC extraction fails

Decision

We will implement a pre-processing ChapterSplitter that runs before the existing IngestionService. The existing ingestion pipeline remains unchanged — only the input changes (one chapter PDF instead of one full curriculum PDF).

Chapter Splitting Strategy

  • Split unit: Chapter or lesson (whichever is the natural unit in each curriculum)
  • Split identification: Pattern-based detection per language
  • Fallback layers: TOC extraction → Header scan → Heuristic split → Full document
  • Language support: French, English, Arabic (RTL)

Asset Extraction Strategy

  • Extract visual content (images, tables, diagrams) alongside text
  • Classify content vs decorative images
  • Generate semantic descriptions for content-relevant images
  • Preserve original images for display; store text for searchability

Consequences

Positive

  • Chapter-level retrieval granularity
  • Visual content preserved and searchable
  • Handles multi-language curricula uniformly
  • Existing ingestion pipeline unchanged

Negative

  • Additional pre-processing step adds latency to ingestion
  • More complex rollback if chapter boundaries are wrong
  • Asset extraction requires vision model API calls

Risks

  • Arabic RTL text extraction may be inconsistent
  • Some curricula may have non-standard chapter structures
  • TOC may be missing or unreliable in elementary books

References