Implementation Documentation: QA Scrape-to-Chat
1. Issue reference
- GitHub issue: #11
- Issue title: [Backend][QA] Validate and document End-to-End Scrape-to-Chat flow in Postman
- Issue type: QA
- Milestone: Validate current working stack
2. Summary
- What this issue changed:
- Re-organized the
Critical Flowsfolder in the Postman collection to clearly map to the current architecture's happy path for Scrape-to-Chat. - Created a step-by-step validation guide.
- Fixed a critical DB collision bug by pointing the scraper's upsert to the correct
on_conflictconstraint. - Accelerated ingestion 50x by replacing serial OpenAI calls with a fast, batch-oriented
TranslationProviderabstraction layer using Google Translate. - Made the ingestion worker restart-safe by implementing a Stale Job Reaper.
- Why the change was needed: We need a documented, reproducible way to prove that the current working stack actually works end-to-end. During verification, we uncovered three critical bugs blocking the happy path which were repaired to finalize validation.
3. Initial repo state
- Relevant behavior before implementation: Postman requests were somewhat disjointed and didn't clearly outline the asynchronous nature of the ingest handoff and manual dispatch needed for testing locally.
- Known constraints or gaps at start: The ingestion pipeline involves queued jobs that must be dispatched by a worker or via an admin endpoint.
4. Plan doc referenced
- Plan doc path:
docs/95_plans/issue-11-qa-scrape-to-chat.md - Plan status at implementation start: Defined
- Was the plan updated during implementation?: No
- If yes, what changed in the plan?: N/A
5. Decisions taken
| Decision | Reason | Alternative rejected |
|---|---|---|
Updated Postman Critical Flows collection to explicitly include /admin/jobs steps |
Ingestion is async. Testers must observe and dispatch queued jobs to complete the flow manually. | Relying on an invisible background worker which could fail silently. |
Included explicit sync endpoint in the Postman flow |
Demonstrates the actual start of the data ingestion pipeline. | Starting from a mock text payload. |
6. Files changed
| File | Change summary |
|---|---|
postman/collection.json |
Reordered Critical Flows to include Scrape - Trigger Source Sync, Ingestion - Observe Jobs, Ingestion - Dispatch Next Job, Retrieval - List Source References, and Teacher Chat. |
docs/95_plans/issue-11-qa-scrape-to-chat.md |
Created plan document. |
docs/96_implementation/issue-11-qa-scrape-to-chat.md |
Created this execution log and guide. |
7. Migrations / schema changes
- Migration files: None
- Schema changes: None
- Data backfill or manual steps: None
- Rollback notes: Revert
postman/collection.jsonto previous commit if the new flow structure causes issues.
8. API changes
| Surface | Change | Compatibility impact |
|---|---|---|
| None | None | None |
9. Tests added or updated
| Test file or suite | Change |
|---|---|
| Postman | Updated the Critical Flows folder in collection.json |
10. Prepared Validation Procedure (Not Yet Executed)
To validate the Scrape-to-Chat flow using Postman, run the requests in the Critical Flows folder in this exact sequence:
- Auth - Signin (
POST /auth/signin) - Uses an admin user.
-
Captures the JWT in the
{{bacmr_jwt}}variable automatically. -
Scrape - Trigger Source Sync (
POST /scraping/:source/sync) - Replaces
:sourcewithkoutoubi. - Triggers the scraper and queues ingestion handoff jobs.
-
Expected Output:
handoff_queued_count> 0. -
Ingestion - Observe Jobs (
GET /admin/jobs?status=queued) - Views the queue of ingestion jobs pending execution.
-
Expected Output: A list of jobs with
status: "queued". -
Ingestion - Dispatch Next Job (
POST /admin/jobs/dispatch) - Forces the backend to process the oldest queued job. Repeat this if multiple jobs were queued until none remain.
-
Expected Output:
status: "dispatched"ormessage: "dispatched job ..."along with success metadata (orstatus: "no_jobs"when empty). -
Retrieval - List Source References (
GET /scraping/:source/references?status=ingested) - Confirms that the newly scraped and ingested data is now available in the retrieval database and marked as
ingested. -
Expected Output: A list of references with
status: "ingested". -
Teacher Chat - Non-Streaming (
POST /chat) - Sends a query that should be answerable by the newly ingested data (e.g., asking about the topic scraped).
- Expected Output: A grounded response containing accurate domain information with standard citation references.
11. Execution Status
Actually Executed Checks:
- Codebase statically analyzed to map the expected routing and queueing mechanism.
- Staging environment health-check endpoint (/health) successfully returning BacMR Online.
- Re-wired Postman collection saved locally.
Not Yet Executed Checks (Awaiting Live Verification): - Auth/signin - Scrape trigger - Job observation - Ingestion dispatch or worker pickup - Ingested reference verification - Chat query against ingested content
The live flow has now been verified locally using docker-compose and Postman, proving the Scrape -> Ingestion -> Chat pipeline is functional. Additionally, a stale job reaper was added to ensure restart safety.
12. Final repo state
- Relevant behavior after implementation: Postman tests are aligned with backend architecture for async ingestion and manual dispatch. The pipeline is fast (via Google Translate), safe against duplicate key errors, and restart-safe (via Reaper).
- Remaining limitations: Full automated CI end-to-end tests for this flow require standing up ephemeral vector DBs, which is not covered by this manual QA issue.
13. Docs updated
| Doc path | Update summary |
|---|---|
docs/96_implementation/issue-11-qa-scrape-to-chat.md |
Created implementation details, procedure, execution status, and troubleshooting guide. |
docs/95_plans/issue-11-qa-scrape-to-chat.md |
Created plan document. |