Notes from production
EvalCI/CDRAG

Why “faithfulness ≥ 0.90” should gate your deploys

A demo that sounds right and a system that is right are different things. The gap between them is an eval you actually enforce. Faithfulness — does the answer stay grounded in the retrieved context — is the single most useful number to gate a release on.

Turn the score into a gate

Score a representative set on every change, compare against an agreed threshold (we like ≥ 0.90 to start), and fail the pipeline when it drops. The number stops being a vanity metric the moment a red build blocks a merge.

When a release fails the gate, you have three honest moves: improve retrieval, constrain generation, or abstain. Shipping anyway is not on the list.

Wire it into CI next to your tests, report the trend every sprint, and the conversation shifts from “does it feel good” to “did it pass.”