Skip to content

Promptfoo RAG Evaluation Plan (Backend)

Goal

Introduce production-facing regression tests for the BacMR backend RAG workflow using Promptfoo.

What is covered

  • Retrieval relevance by subject/grade/namespace
  • Multilingual request/response behavior (Arabic/French)
  • Safety and prompt-injection resistance
  • Clarification and fallback behavior in edge cases

Files

  • promptfoo/promptfooconfig.yaml (deterministic suite)
  • promptfoo/promptfooconfig.model.yaml (model-graded suite)
  • promptfoo/tests/*.yaml (test cases)
  • promptfoo/transforms/response.js (response extraction)
  • .github/workflows/promptfoo-rag-eval.yml (CI workflow)

Operating model

  1. Run deterministic suite on every promptfoo-related PR.
  2. Run the workflow manually via workflow_dispatch when needed.
  3. Run post-merge smoke checks on main for promptfoo/workflow path changes.
  4. Run model-graded suite when OPENAI_API_KEY is available.
  5. Expand datasets as curriculum coverage grows.

CI observability outputs

The workflow now emits explicit START <step> / END <step> status=<pass|fail> markers for each major stage (including checkout/upload action boundaries) and publishes artifacts for review:

  • promptfoo-rag-eval-artifacts-<run_id> artifact bundle
  • Deterministic suite outputs (json, html, md) under promptfoo-artifacts/deterministic/
  • Model-graded suite outputs (json, html, md) under promptfoo-artifacts/model/ (when executed)
  • Compact machine-readable run summary at promptfoo-artifacts/run-summary.txt
  • Artifact index (path|bytes) at promptfoo-artifacts/artifact-index.txt
  • Human-readable gate table in GITHUB_STEP_SUMMARY

Preflight validation now hard-fails early when required runtime config is missing and writes missing keys into run-summary.txt for rapid triage.

Deterministic suite remains the merge gate. Model-graded suite is informational/non-blocking.

Next improvements

  • Add golden dataset tied to real curriculum chunks.
  • Add retrieval-score drift alerting across releases.
  • Add per-grade pass-rate dashboards.