Skip to content

Docs

Architecture and planning documents for BacMR backend.

Architecture Decision Records (ADRs)

Ingestion Pipeline

Quick Summary

This feature adds two new pre-processing steps before the existing IngestionService:

PDF → ChapterSplitter → Chapter PDFs → AssetExtractor → Text + Images + Tables → IngestionService → Chunks + Vectors

What's new

  1. ChapterSplitter (app/services/chapter_splitter.py)
  2. Language-specific TOC parsing (fr/en/ar)
  3. 4-layer fallback if TOC unavailable
  4. Page verification to confirm chapter boundaries
  5. Preview mode for human review before ingestion

  6. AssetExtractor (app/services/asset_extractor.py)

  7. Extract images, tables, vector drawings from PDFs
  8. Classify content vs decorative images
  9. Generate semantic descriptions via GPT-4o vision
  10. Convert tables to structured markdown

  11. RetrievalPipeline updates (Phase 3)

  12. Return linked image assets with text results
  13. Visual query boosting when query suggests image intent

Status

Current branch: feature/chapter-splitting-and-asset-extraction
PR: Open for review
Phase: Planning complete, awaiting approval to implement