Evaluation you can automate - Artificial Intelligence (AI)

What to measure

Correctness: reference answers or heuristics per intent.
Safety: jailbreak attempts, prompt injection, and PIIs leaking in outputs.
Performance: tail latency (p95/p99), cache hit ratio, and cost per request.

If you only measure “accuracy” you will fool yourself. Quality is multi-dimensional. Also, the goal is not to win a benchmark. The goal is to make the system safe and useful in your context.

A simple evaluation pack you can actually run

Make a small pack per use case. Keep it boring. Make it repeatable.

Golden questions: 20 to 50 questions you agree matter.
Red-team prompts: 10 to 30 prompts designed to break policy (prompt injection, data exfiltration, unsafe instructions).
Latency probes: the same request repeated to measure variance and tail.
Cost probes: a fixed set of typical inputs and outputs to estimate cost per request.

If you cannot run the pack in CI, it is not an evaluation pack. It is a document.

CI-friendly checks

Small, curated datasets per use case.
Deterministic prompts with seeded randomness where possible.
Regression thresholds that stop a deploy when quality dips.

Strip identifiers at the edge. Keep evaluation datasets anonymised and scoped to the minimum fields required.

Worked example: evaluating a retrieval assistant

Scenario: your assistant answers questions using retrieved snippets and citations.

Define success in plain language:

Answer quality: the answer matches the cited sources and does not invent claims.
Citation quality: citations point to the correct sources, not just “somewhere in the docs”.
Refusal quality: unsafe or unsupported questions are refused with a helpful explanation.

Then create a small rubric you can score quickly:

Correctness (0 to 2):
- 0: wrong or unsupported
- 1: partially correct or missing nuance
- 2: correct, with the right caveats
Grounding (0 to 2):
- 0: no citation or irrelevant citation
- 1: citation exists but is weak
- 2: citation directly supports the claim
Safety (pass or fail):
- fail if it leaks personal data, gives unsafe instructions, or ignores policy.

If you want a single number for a dashboard, you can compute:

\text{quality score} = \frac{\sum(\text{correctness} + \text{grounding})}{4N} \times 100

Where (N) is the number of evaluated prompts. Do not confuse this number with “truth”. It is a summary for tracking drift, not a badge.

Common mistakes (that quietly break evaluation)

Changing the questions every run: you lose comparability. Keep a stable core set.
Leaking the answer in the prompt: you think the system is smart, but you accidentally gave it the mark scheme.
Measuring only happy paths: real users are messy. Your evaluation pack should be messy too.
Ignoring tail latency: a mean of 2 seconds can hide a p95 of 10 seconds, which ruins trust.
Letting “safe” mean “useless”: a system that refuses everything is safe and pointless. You need a balance you can defend.

Verification checklist (what I check before a release)

Reproducibility:
- Same inputs, same config, same outputs within tolerance.
- Model version and prompt version are recorded.
Regression gates:
- Quality score does not drop below an agreed threshold.
- Safety tests must be 100 percent pass for the “hard fail” set.
Latency gates:
- p95 and p99 are within budget.
Cost gates:
- You can estimate cost per request and it matches your product assumptions.

The two bits of maths that help most

1) Tail latency percentiles

Percentiles make slowness visible.

(p95): 95 percent of requests are faster than this.
(p99): 99 percent of requests are faster than this.

If your system has a p99 of 30 seconds, you have a “sometimes broken” product.

2) Confidence and uncertainty (lightweight version)

If you evaluate on only 10 questions, your score will bounce around and you will overreact.

You do not need to be a statistician to act sensibly:

Use at least 30 to 50 stable evaluation cases for a core pack.
Track the score over time.
Look for sustained shifts, not one bad day.

Reflection prompts

Which failure do you fear more, unsafe output or confident nonsense, and why.
What is your “stop the line” rule. Which evaluation failures should block release every time.
If you had to cut evaluation effort by half, what would you keep, and what would you drop.