Promptfoo RAG Evaluation Plan (Backend)

Goal

Introduce production-facing regression tests for the BacMR backend RAG workflow using Promptfoo.

What is covered

Retrieval relevance by subject/grade/namespace
Multilingual request/response behavior (Arabic/French)
Safety and prompt-injection resistance
Clarification and fallback behavior in edge cases

Files

promptfoo/promptfooconfig.yaml (deterministic suite)
promptfoo/promptfooconfig.model.yaml (model-graded suite)
promptfoo/tests/*.yaml (test cases)
promptfoo/transforms/response.js (response extraction)
.github/workflows/promptfoo-rag-eval.yml (CI workflow)

Operating model

Run deterministic suite on every promptfoo-related PR.
Run the workflow manually via workflow_dispatch when needed.
Run post-merge smoke checks on main for promptfoo/workflow path changes.
Run model-graded suite when OPENAI_API_KEY is available.
Expand datasets as curriculum coverage grows.

CI observability outputs

The workflow now emits explicit START <step> / END <step> status=<pass|fail> markers for each major stage (including checkout/upload action boundaries) and publishes artifacts for review:

promptfoo-rag-eval-artifacts-<run_id> artifact bundle
Deterministic suite outputs (json, html, md) under promptfoo-artifacts/deterministic/
Model-graded suite outputs (json, html, md) under promptfoo-artifacts/model/ (when executed)
Compact machine-readable run summary at promptfoo-artifacts/run-summary.txt
Artifact index (path|bytes) at promptfoo-artifacts/artifact-index.txt
Human-readable gate table in GITHUB_STEP_SUMMARY

Preflight validation now hard-fails early when required runtime config is missing and writes missing keys into run-summary.txt for rapid triage.

Deterministic suite remains the merge gate. Model-graded suite is informational/non-blocking.

Next improvements

Add golden dataset tied to real curriculum chunks.
Add retrieval-score drift alerting across releases.
Add per-grade pass-rate dashboards.