Promptfoo RAG Evaluation Plan (Backend)
Goal
Introduce production-facing regression tests for the BacMR backend RAG workflow using Promptfoo.
What is covered
- Retrieval relevance by subject/grade/namespace
- Multilingual request/response behavior (Arabic/French)
- Safety and prompt-injection resistance
- Clarification and fallback behavior in edge cases
Files
promptfoo/promptfooconfig.yaml(deterministic suite)promptfoo/promptfooconfig.model.yaml(model-graded suite)promptfoo/tests/*.yaml(test cases)promptfoo/transforms/response.js(response extraction).github/workflows/promptfoo-rag-eval.yml(CI workflow)
Operating model
- Run deterministic suite on every promptfoo-related PR.
- Run the workflow manually via
workflow_dispatchwhen needed. - Run post-merge smoke checks on
mainfor promptfoo/workflow path changes. - Run model-graded suite when
OPENAI_API_KEYis available. - Expand datasets as curriculum coverage grows.
CI observability outputs
The workflow now emits explicit START <step> / END <step> status=<pass|fail> markers for each major stage (including checkout/upload action boundaries) and publishes artifacts for review:
promptfoo-rag-eval-artifacts-<run_id>artifact bundle- Deterministic suite outputs (
json,html,md) underpromptfoo-artifacts/deterministic/ - Model-graded suite outputs (
json,html,md) underpromptfoo-artifacts/model/(when executed) - Compact machine-readable run summary at
promptfoo-artifacts/run-summary.txt - Artifact index (
path|bytes) atpromptfoo-artifacts/artifact-index.txt - Human-readable gate table in
GITHUB_STEP_SUMMARY
Preflight validation now hard-fails early when required runtime config is missing and writes missing keys into run-summary.txt for rapid triage.
Deterministic suite remains the merge gate. Model-graded suite is informational/non-blocking.
Next improvements
- Add golden dataset tied to real curriculum chunks.
- Add retrieval-score drift alerting across releases.
- Add per-grade pass-rate dashboards.