Docs
Architecture and planning documents for BacMR backend.
Architecture Decision Records (ADRs)
- ADR-0001: Chapter-Aware Ingestion — Split curricula into chapter/lesson units before ingestion; extract visual assets
Ingestion Pipeline
- Chapter Splitting Architecture — How we split curriculum PDFs into chapter/lesson units
- Asset Extraction Architecture — How we extract images, tables, and diagrams
- Implementation Plan — Phased build plan with tasks and timeline
Quick Summary
This feature adds two new pre-processing steps before the existing IngestionService:
PDF → ChapterSplitter → Chapter PDFs → AssetExtractor → Text + Images + Tables → IngestionService → Chunks + Vectors
What's new
- ChapterSplitter (
app/services/chapter_splitter.py) - Language-specific TOC parsing (fr/en/ar)
- 4-layer fallback if TOC unavailable
- Page verification to confirm chapter boundaries
-
Preview mode for human review before ingestion
-
AssetExtractor (
app/services/asset_extractor.py) - Extract images, tables, vector drawings from PDFs
- Classify content vs decorative images
- Generate semantic descriptions via GPT-4o vision
-
Convert tables to structured markdown
-
RetrievalPipeline updates (Phase 3)
- Return linked image assets with text results
- Visual query boosting when query suggests image intent
Status
Current branch: feature/chapter-splitting-and-asset-extraction
PR: Open for review
Phase: Planning complete, awaiting approval to implement