Implementation Plan: Chapter Splitting + Asset Extraction
Goal: Incrementally implement chapter-aware ingestion with visual content extraction for BacMR curricula.
Overview
The implementation is split into 4 phases, each building on the previous:
| Phase | Focus | Duration | Risk |
|---|---|---|---|
| Phase 1 | Chapter splitting only | ~1 week | Low |
| Phase 2 | Asset extraction (images + tables) | ~1 week | Medium |
| Phase 3 | Retrieval integration + visual query boosting | ~3 days | Low |
| Phase 4 | Review + refinement | Ongoing | Low |
Phase 1: Chapter Splitting
Goal
Split each curriculum PDF into chapter/lesson units before ingestion, with full fallback layers.
Deliverables
- [ ]
app/services/chapter_splitter.py—ChapterSplitterclass - [ ]
chapter_splitsdatabase table - [ ] Language pattern registry (fr, en, ar)
- [ ] 4-layer fallback system (TOC → Header → Heuristic → Full doc)
- [ ]
ChapterSplitter.split()— main entry point - [ ]
ChapterSplitter.preview()— dry-run for human review - [ ]
ChapterSplitter.ingest()— split + create DB records + trigger ingestion - [ ] Page verification step
- [ ] Noise phrase filtering
- [ ] Unit tests for all 3 language patterns
- [ ] Integration test with French Math book
- [ ] Integration test with English book
- [ ] Integration test with Arabic History book
Tasks
1.1 Database Migration
CREATE TABLE chapter_splits (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
reference_id UUID NOT NULL REFERENCES references(id),
chapter_title TEXT NOT NULL,
chapter_number TEXT,
chapter_type TEXT DEFAULT 'chapter',
unit_title TEXT,
unit_number TEXT,
language TEXT NOT NULL,
page_start INTEGER NOT NULL,
page_end INTEGER NOT NULL,
parsing_method TEXT NOT NULL,
confidence FLOAT NOT NULL DEFAULT 1.0,
needs_review BOOLEAN DEFAULT FALSE,
ingestion_job_id UUID REFERENCES ingestion_jobs(id),
ingestion_status TEXT DEFAULT 'pending',
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
ALTER TABLE references ADD COLUMN split_status TEXT DEFAULT 'not_applicable';
ALTER TABLE references ADD COLUMN total_chapters INTEGER;
1.2 ChapterSplitter Core
class ChapterSplitter:
def split(
self,
pdf_bytes: bytes,
language_hint: str | None = None,
expected_chapter_count: int | None = None,
) -> ChapterSplitResult
def preview(
self,
pdf_bytes: bytes,
language_hint: str | None = None,
) -> ChapterSplitResult
def ingest(
self,
pdf_bytes: bytes,
reference_id: UUID,
language_hint: str | None = None,
) -> list[dict]
1.3 Language Pattern Registry
PATTERNS = {
"fr": TOCPattern(...), # CHAPITRE, Table des matières
"en": TOCPattern(...), # UNIT, Lesson, Table of contents
"ar": TOCPattern(...), # الدرس, الوحدة, الفهرس
}
1.4 Layer Implementations
Layer 1: parse_toc_from_pages() — scan first 20 pages for TOC
Layer 2: scan_headers_all_pages() — regex scan for chapter headers
Layer 3: heuristic_split() — equal page ranges
Layer 4: full_document_fallback() — single chapter, flag needs_review
1.5 Testing Strategy
- Use the 3 sample PDFs we already have:
fondamentals/IMR-1AF-M.pdf(Arabic, primary — no TOC)secondaire1s/ANG-1AS-M.pdf(English, 17 units)secondaire2s/MA-4AS-M.pdf(French Math, 18 chapters)secondaire2s/HIS-4AS-M.pdf(Arabic History, 13 lessons — best test case)
Phase 2: Asset Extraction
Goal
Extract images, tables, and diagrams from each chapter PDF; classify content vs decorative; generate semantic descriptions for content images.
Deliverables
- [ ]
app/services/asset_extractor.py—AssetExtractorclass - [ ]
app/services/image_describer.py—ImageDescriberusing GPT-4o vision - [ ]
app/services/table_extractor.py—TableExtractor - [ ]
curriculum_assetsdatabase table - [ ]
chunkstable extensions (content_type, linked_asset_ids) - [ ] ContentClassifier with heuristic rules
- [ ] Integration with existing chunking pipeline
- [ ] Unit tests for extraction
- [ ] Unit tests for content classification
Tasks
2.1 AssetExtractor
class AssetExtractor:
def extract(self, pdf_bytes: bytes) -> AssetExtractionResult:
# Uses PyMuPDF:
# - page.get_images() for image extraction
# - page.find_tables() for table detection
# - page.get_text("dict") for text blocks with positions
# - page.get_drawings() for vector graphics
2.2 ContentClassifier
class ContentClassifier:
def is_content_image(
self,
img: ExtractedImage,
page_context: str
) -> tuple[bool, float]:
# Heuristics:
# - Image area (>50k px² = content)
# - Has OCR text (>10 chars = likely labeled diagram)
# - Caption nearby
# - Decorative patterns (IPN logo = decorative)
2.3 ImageDescriber
class ImageDescriber:
def describe(
self,
img: ExtractedImage,
chapter_context: str,
language: str,
) -> ImageDescription:
# Uses GPT-4o vision
# Returns: ocr_text, semantic_description, alt_text, search_keywords
# Rate limited: batch descriptions, don't call per-image
2.4 TableExtractor
class TableExtractor:
def extract_table(self, table) -> ExtractedTable:
# Converts PyMuPDF table to markdown
# Preserves structure for HTML rendering in frontend
2.5 Database Extensions
CREATE TABLE curriculum_assets (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
chapter_split_id UUID REFERENCES chapter_splits(id),
document_id UUID REFERENCES documents(id),
asset_type TEXT NOT NULL, -- "image" | "table" | "drawing"
content_type TEXT NOT NULL, -- "decorative" | "content"
image_bytes BYTEA,
image_mime_type TEXT,
image_width INTEGER,
image_height INTEGER,
ocr_text TEXT,
semantic_description TEXT,
alt_text TEXT,
search_text TEXT,
page_number INTEGER NOT NULL,
bounding_box JSONB,
linked_chunk_ids UUID[],
confidence FLOAT DEFAULT 1.0,
needs_review BOOLEAN DEFAULT FALSE,
created_at TIMESTAMPTZ DEFAULT NOW()
);
ALTER TABLE chunks ADD COLUMN content_type TEXT DEFAULT 'text';
ALTER TABLE chunks ADD COLUMN is_visual_reference BOOLEAN DEFAULT FALSE;
ALTER TABLE chunks ADD COLUMN linked_asset_ids UUID[];
ALTER TABLE chunks ADD COLUMN page_coordinates JSONB;
2.6 Integration with Chunking
For each chapter PDF:
assets = AssetExtractor.extract(chapter_pdf)
# Process images
for img in assets.images:
is_content, conf = ContentClassifier.is_content_image(img, page_context)
if is_content:
desc = ImageDescriber.describe(img, chapter_context, language)
store curriculum_assets
# Process tables → markdown chunks
for tbl in assets.tables:
markdown = TableExtractor.extract_table(tbl)
treat as text chunk with content_type='table'
# Process text blocks → normal chunks
cleaned_blocks = clean_text(assets.text_blocks)
for block in cleaned_blocks:
chunk = ChunkingService.chunk_text(block.text, ...)
chunk.content_type = 'text'
chunk.linked_asset_ids = find_nearby_assets(block, assets)
Phase 3: Retrieval Integration
Goal
Update retrieval pipeline to return linked assets with results; add visual query boosting.
Deliverables
- [ ] Update
RetrievalResultto includelinked_assets - [ ] Visual query detection and boosting
- [ ] Update
RetrievalPipelineto join withcurriculum_assets - [ ] Frontend rendering contract (what fields the frontend needs)
- [ ] Unit tests
Tasks
3.1 Retrieval Pipeline Changes
# In retrieval_pipeline.py
def _fetch_chunk_assets(self, chunk_id: str) -> list[dict]:
"""Fetch linked curriculum_assets for a chunk."""
result = self.supabase.table("curriculum_assets").select("*").contains(
"linked_chunk_ids", [chunk_id]
).execute()
return result.data or []
# In retrieve():
for chunk in reranked_chunks:
chunk["text"] = self._fetch_chunk_text(chunk["id"])
chunk["linked_assets"] = self._fetch_chunk_assets(chunk["id"])
# Boost visual queries
if is_visual_query(query) and chunk["linked_assets"]:
chunk["score"] *= 1.2
3.2 Visual Query Detection
VISUAL_KEYWORDS = {
"en": ["picture", "photo", "graph", "diagram", "figure", "image"],
"fr": ["image", "photo", "graphique", "diagramme", "figure"],
"ar": ["صورة", "رسم", "شكل"],
}
def is_visual_query(query: str) -> bool:
return any(kw in query.lower() for kw in sum(VISUAL_KEYWORDS.values(), []))
3.3 Frontend Contract
The API response for a retrieval result should include:
{
"chunk_id": "...",
"text": "...",
"content_type": "text",
"score": 0.85,
"chapter_title": "The Pythagorean Theorem",
"lesson_title": "Chapter 11",
"linked_assets": [
{
"asset_id": "...",
"asset_type": "image",
"image_url": "/api/assets/{asset_id}/image",
"alt_text": "Right triangle ABC with sides 3cm, 4cm, 5cm",
"semantic_description": "A right triangle labeled ABC...",
"ocr_text": "BC = 5cm, AB = 3cm, AC = 4cm"
}
]
}
Phase 4: Review and Refinement
Tasks
- [ ] Performance testing: ingestion latency with chapter splitting vs full doc
- [ ] Quality review: manually verify chapter boundaries on 5 curricula
- [ ] Asset extraction quality: spot-check descriptions on 20 images
- [ ] Retrieval quality: test chapter-level vs chunk-level queries
- [ ] Arabic RTL handling: verify text extraction and searchability
- [ ] Confidence threshold tuning: what score cutoff triggers manual review?
- [ ] Rate limiting for ImageDescriber: batch processing to control API costs
- [ ] Documentation: update API docs and internal runbooks
File Structure
app/
services/
chapter_splitter.py # NEW
asset_extractor.py # NEW
image_describer.py # NEW
table_extractor.py # NEW
ingestion.py # EXISTING - unchanged
chunking.py # EXISTING - unchanged
retrieval_pipeline.py # EXISTING - updated in Phase 3
docs/
adr/
0001-chapter-aware-ingestion.md # Architecture decision record
ingestion/
chapter-splitting.md # Chapter splitting spec
asset-extraction.md # Asset extraction spec
implementation-plan.md # This file
External Dependencies
| Dependency | Purpose | Phase |
|---|---|---|
| PyMuPDF | PDF text/image/table extraction | 1+2 |
| OpenAI GPT-4o | Vision image description | 2 |
| Existing Supabase | Data storage | All |
| Existing IngestionService | Chunking + embedding | 1+2 |
Risks and Mitigations
| Risk | Likelihood | Mitigation |
|---|---|---|
| Arabic RTL page numbers in TOC don't match PDF pages | Medium | Page verification step (±1, ±2 offset) |
| Elementary Arabic books have no chapter structure | High (accepted) | Layer 3 heuristic + full doc fallback; not priority |
| GPT-4o vision cost for large curricula | Medium | Batch processing; cache descriptions; only describe content images |
| TOC at end of Arabic books (page 95/96) | Low | Layer 1 scans all pages, not just first 20 |
| Some curricula have scanned pages (OCR quality) | Low | Detect via text extraction quality; flag for review |
Out of Scope (Future)
- [ ] Multi-column layout detection
- [ ] Handwritten content recognition
- [ ] Audio content (pronunciation guides)
- [ ] Interactive content (clickable diagrams)
- [ ] Cross-curriculum retrieval (query "Pythagoras" across math books)
- [ ] Version diffing when curriculum is updated