Skip to content

Implementation Plan: Chapter Splitting + Asset Extraction

Goal: Incrementally implement chapter-aware ingestion with visual content extraction for BacMR curricula.


Overview

The implementation is split into 4 phases, each building on the previous:

Phase Focus Duration Risk
Phase 1 Chapter splitting only ~1 week Low
Phase 2 Asset extraction (images + tables) ~1 week Medium
Phase 3 Retrieval integration + visual query boosting ~3 days Low
Phase 4 Review + refinement Ongoing Low

Phase 1: Chapter Splitting

Goal

Split each curriculum PDF into chapter/lesson units before ingestion, with full fallback layers.

Deliverables

  • [ ] app/services/chapter_splitter.pyChapterSplitter class
  • [ ] chapter_splits database table
  • [ ] Language pattern registry (fr, en, ar)
  • [ ] 4-layer fallback system (TOC → Header → Heuristic → Full doc)
  • [ ] ChapterSplitter.split() — main entry point
  • [ ] ChapterSplitter.preview() — dry-run for human review
  • [ ] ChapterSplitter.ingest() — split + create DB records + trigger ingestion
  • [ ] Page verification step
  • [ ] Noise phrase filtering
  • [ ] Unit tests for all 3 language patterns
  • [ ] Integration test with French Math book
  • [ ] Integration test with English book
  • [ ] Integration test with Arabic History book

Tasks

1.1 Database Migration

CREATE TABLE chapter_splits (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    reference_id UUID NOT NULL REFERENCES references(id),
    chapter_title TEXT NOT NULL,
    chapter_number TEXT,
    chapter_type TEXT DEFAULT 'chapter',
    unit_title TEXT,
    unit_number TEXT,
    language TEXT NOT NULL,
    page_start INTEGER NOT NULL,
    page_end INTEGER NOT NULL,
    parsing_method TEXT NOT NULL,
    confidence FLOAT NOT NULL DEFAULT 1.0,
    needs_review BOOLEAN DEFAULT FALSE,
    ingestion_job_id UUID REFERENCES ingestion_jobs(id),
    ingestion_status TEXT DEFAULT 'pending',
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

ALTER TABLE references ADD COLUMN split_status TEXT DEFAULT 'not_applicable';
ALTER TABLE references ADD COLUMN total_chapters INTEGER;

1.2 ChapterSplitter Core

class ChapterSplitter:
    def split(
        self, 
        pdf_bytes: bytes, 
        language_hint: str | None = None,
        expected_chapter_count: int | None = None,
    ) -> ChapterSplitResult

    def preview(
        self,
        pdf_bytes: bytes,
        language_hint: str | None = None,
    ) -> ChapterSplitResult

    def ingest(
        self,
        pdf_bytes: bytes,
        reference_id: UUID,
        language_hint: str | None = None,
    ) -> list[dict]

1.3 Language Pattern Registry

PATTERNS = {
    "fr": TOCPattern(...),   # CHAPITRE, Table des matières
    "en": TOCPattern(...),   # UNIT, Lesson, Table of contents  
    "ar": TOCPattern(...),   # الدرس, الوحدة, الفهرس
}

1.4 Layer Implementations

Layer 1: parse_toc_from_pages()   — scan first 20 pages for TOC
Layer 2: scan_headers_all_pages()  — regex scan for chapter headers
Layer 3: heuristic_split()         — equal page ranges
Layer 4: full_document_fallback()  — single chapter, flag needs_review

1.5 Testing Strategy

  • Use the 3 sample PDFs we already have:
  • fondamentals/IMR-1AF-M.pdf (Arabic, primary — no TOC)
  • secondaire1s/ANG-1AS-M.pdf (English, 17 units)
  • secondaire2s/MA-4AS-M.pdf (French Math, 18 chapters)
  • secondaire2s/HIS-4AS-M.pdf (Arabic History, 13 lessons — best test case)

Phase 2: Asset Extraction

Goal

Extract images, tables, and diagrams from each chapter PDF; classify content vs decorative; generate semantic descriptions for content images.

Deliverables

  • [ ] app/services/asset_extractor.pyAssetExtractor class
  • [ ] app/services/image_describer.pyImageDescriber using GPT-4o vision
  • [ ] app/services/table_extractor.pyTableExtractor
  • [ ] curriculum_assets database table
  • [ ] chunks table extensions (content_type, linked_asset_ids)
  • [ ] ContentClassifier with heuristic rules
  • [ ] Integration with existing chunking pipeline
  • [ ] Unit tests for extraction
  • [ ] Unit tests for content classification

Tasks

2.1 AssetExtractor

class AssetExtractor:
    def extract(self, pdf_bytes: bytes) -> AssetExtractionResult:
        # Uses PyMuPDF:
        # - page.get_images() for image extraction
        # - page.find_tables() for table detection
        # - page.get_text("dict") for text blocks with positions
        # - page.get_drawings() for vector graphics

2.2 ContentClassifier

class ContentClassifier:
    def is_content_image(
        self, 
        img: ExtractedImage, 
        page_context: str
    ) -> tuple[bool, float]:
        # Heuristics:
        # - Image area (>50k px² = content)
        # - Has OCR text (>10 chars = likely labeled diagram)
        # - Caption nearby
        # - Decorative patterns (IPN logo = decorative)

2.3 ImageDescriber

class ImageDescriber:
    def describe(
        self, 
        img: ExtractedImage, 
        chapter_context: str,
        language: str,
    ) -> ImageDescription:
        # Uses GPT-4o vision
        # Returns: ocr_text, semantic_description, alt_text, search_keywords
        # Rate limited: batch descriptions, don't call per-image

2.4 TableExtractor

class TableExtractor:
    def extract_table(self, table) -> ExtractedTable:
        # Converts PyMuPDF table to markdown
        # Preserves structure for HTML rendering in frontend

2.5 Database Extensions

CREATE TABLE curriculum_assets (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    chapter_split_id UUID REFERENCES chapter_splits(id),
    document_id UUID REFERENCES documents(id),
    asset_type TEXT NOT NULL,          -- "image" | "table" | "drawing"
    content_type TEXT NOT NULL,         -- "decorative" | "content"
    image_bytes BYTEA,
    image_mime_type TEXT,
    image_width INTEGER,
    image_height INTEGER,
    ocr_text TEXT,
    semantic_description TEXT,
    alt_text TEXT,
    search_text TEXT,
    page_number INTEGER NOT NULL,
    bounding_box JSONB,
    linked_chunk_ids UUID[],
    confidence FLOAT DEFAULT 1.0,
    needs_review BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

ALTER TABLE chunks ADD COLUMN content_type TEXT DEFAULT 'text';
ALTER TABLE chunks ADD COLUMN is_visual_reference BOOLEAN DEFAULT FALSE;
ALTER TABLE chunks ADD COLUMN linked_asset_ids UUID[];
ALTER TABLE chunks ADD COLUMN page_coordinates JSONB;

2.6 Integration with Chunking

For each chapter PDF:
    assets = AssetExtractor.extract(chapter_pdf)

    # Process images
    for img in assets.images:
        is_content, conf = ContentClassifier.is_content_image(img, page_context)
        if is_content:
            desc = ImageDescriber.describe(img, chapter_context, language)
            store curriculum_assets

    # Process tables → markdown chunks
    for tbl in assets.tables:
        markdown = TableExtractor.extract_table(tbl)
        treat as text chunk with content_type='table'

    # Process text blocks → normal chunks
    cleaned_blocks = clean_text(assets.text_blocks)
    for block in cleaned_blocks:
        chunk = ChunkingService.chunk_text(block.text, ...)
        chunk.content_type = 'text'
        chunk.linked_asset_ids = find_nearby_assets(block, assets)

Phase 3: Retrieval Integration

Goal

Update retrieval pipeline to return linked assets with results; add visual query boosting.

Deliverables

  • [ ] Update RetrievalResult to include linked_assets
  • [ ] Visual query detection and boosting
  • [ ] Update RetrievalPipeline to join with curriculum_assets
  • [ ] Frontend rendering contract (what fields the frontend needs)
  • [ ] Unit tests

Tasks

3.1 Retrieval Pipeline Changes

# In retrieval_pipeline.py
def _fetch_chunk_assets(self, chunk_id: str) -> list[dict]:
    """Fetch linked curriculum_assets for a chunk."""
    result = self.supabase.table("curriculum_assets").select("*").contains(
        "linked_chunk_ids", [chunk_id]
    ).execute()
    return result.data or []

# In retrieve():
for chunk in reranked_chunks:
    chunk["text"] = self._fetch_chunk_text(chunk["id"])
    chunk["linked_assets"] = self._fetch_chunk_assets(chunk["id"])

    # Boost visual queries
    if is_visual_query(query) and chunk["linked_assets"]:
        chunk["score"] *= 1.2

3.2 Visual Query Detection

VISUAL_KEYWORDS = {
    "en": ["picture", "photo", "graph", "diagram", "figure", "image"],
    "fr": ["image", "photo", "graphique", "diagramme", "figure"],
    "ar": ["صورة", "رسم", "شكل"],
}

def is_visual_query(query: str) -> bool:
    return any(kw in query.lower() for kw in sum(VISUAL_KEYWORDS.values(), []))

3.3 Frontend Contract

The API response for a retrieval result should include:

{
  "chunk_id": "...",
  "text": "...",
  "content_type": "text",
  "score": 0.85,
  "chapter_title": "The Pythagorean Theorem",
  "lesson_title": "Chapter 11",
  "linked_assets": [
    {
      "asset_id": "...",
      "asset_type": "image",
      "image_url": "/api/assets/{asset_id}/image",
      "alt_text": "Right triangle ABC with sides 3cm, 4cm, 5cm",
      "semantic_description": "A right triangle labeled ABC...",
      "ocr_text": "BC = 5cm, AB = 3cm, AC = 4cm"
    }
  ]
}


Phase 4: Review and Refinement

Tasks

  • [ ] Performance testing: ingestion latency with chapter splitting vs full doc
  • [ ] Quality review: manually verify chapter boundaries on 5 curricula
  • [ ] Asset extraction quality: spot-check descriptions on 20 images
  • [ ] Retrieval quality: test chapter-level vs chunk-level queries
  • [ ] Arabic RTL handling: verify text extraction and searchability
  • [ ] Confidence threshold tuning: what score cutoff triggers manual review?
  • [ ] Rate limiting for ImageDescriber: batch processing to control API costs
  • [ ] Documentation: update API docs and internal runbooks

File Structure

app/
  services/
    chapter_splitter.py       # NEW
    asset_extractor.py        # NEW
    image_describer.py        # NEW
    table_extractor.py        # NEW
    ingestion.py              # EXISTING - unchanged
    chunking.py               # EXISTING - unchanged
    retrieval_pipeline.py      # EXISTING - updated in Phase 3

docs/
  adr/
    0001-chapter-aware-ingestion.md   # Architecture decision record
  ingestion/
    chapter-splitting.md       # Chapter splitting spec
    asset-extraction.md        # Asset extraction spec
    implementation-plan.md     # This file

External Dependencies

Dependency Purpose Phase
PyMuPDF PDF text/image/table extraction 1+2
OpenAI GPT-4o Vision image description 2
Existing Supabase Data storage All
Existing IngestionService Chunking + embedding 1+2

Risks and Mitigations

Risk Likelihood Mitigation
Arabic RTL page numbers in TOC don't match PDF pages Medium Page verification step (±1, ±2 offset)
Elementary Arabic books have no chapter structure High (accepted) Layer 3 heuristic + full doc fallback; not priority
GPT-4o vision cost for large curricula Medium Batch processing; cache descriptions; only describe content images
TOC at end of Arabic books (page 95/96) Low Layer 1 scans all pages, not just first 20
Some curricula have scanned pages (OCR quality) Low Detect via text extraction quality; flag for review

Out of Scope (Future)

  • [ ] Multi-column layout detection
  • [ ] Handwritten content recognition
  • [ ] Audio content (pronunciation guides)
  • [ ] Interactive content (clickable diagrams)
  • [ ] Cross-curriculum retrieval (query "Pythagoras" across math books)
  • [ ] Version diffing when curriculum is updated