You can unit-test a function: same input, same output, green or red. You cannot unit-test an LLM the same way — the same prompt can give different answers, "correct" is fuzzy, and a tweak that fixes one case silently breaks ten others. So teams ship on vibes: someone tries a few prompts, it "seems better," it goes out. That doesn't scale and it hides regressions. Evals are the discipline that replaces vibes with measurement — a repeatable way to score how well an LLM app or agent does its job, so you can iterate with evidence instead of hope. This piece is about building that: datasets, grading methods, judging with models, evaluating agents, and wiring it all into CI and production.
- You can't unit-test a probabilistic system — evals are statistical: run many cases, score them, track the rate. One example proves nothing.
- The eval set is the asset. A curated "golden set" of realistic cases (input → expected/criteria), grown from production traces, is worth more than any single metric.
- Four ways to grade: heuristic/exact, code-based assertions, LLM-as-judge, and human review — use the cheapest one that's trustworthy for each criterion.
- LLM-as-judge scales human judgment for fuzzy criteria, but it has biases (position, verbosity, self-preference) and must be calibrated against human labels.
- Agents need trajectory evals, not just final-answer checks — did it pick the right tools, in a sane order, without looping or burning budget?
- Two loops: offline evals gate releases (CI regression), online evals (A/B + guardrails) catch what the offline set misses.
- Evals can be gamed. Overfit to your set and you improve the number, not the product — keep a held-out set and refresh from real traffic.
An eval harness is a test suite for non-determinism: a dataset of cases, a way to run your system on each, one or more graders (heuristic, code, LLM-judge, human) that score the output, and aggregation that turns scores into a metric you compare against a baseline. Build the dataset from real traffic, grade fuzzy things with an LLM-judge you've calibrated to humans, evaluate agents on their trajectory not just the answer, gate releases on an offline set, and keep online evals running because production always surprises you.
Why Evals, Not Vibes
Three properties of LLM apps break normal testing. They're non-deterministic (the same input varies, so a single pass tells you little). "Correct" is open-ended (there's rarely one right string — a summary can be good in many ways). And changes have non-local effects (improving a prompt for one case quietly degrades others). The consequence is that you must evaluate statistically: run a population of cases and track the rate of success, the way you'd track a model's accuracy — not eyeball one output. Evals are how a team answers "is this version actually better?" with a number instead of an argument.
What to Evaluate: the Eval Pyramid
Don't only test the whole thing end to end. Like a test pyramid, mix granularities: many cheap component evals (does the retriever return the right docs? does the router pick the right tool? does the output parse as valid JSON?), fewer end-to-end evals (given a user request, is the final answer good?), and a thin layer of expensive human review. Component evals localize failures — when end-to-end drops, they tell you which stage broke — which is exactly what you need in a multi-step pipeline like RAG or an agent.
The Eval Dataset
Everything rests on the dataset, and a good one is the real moat. A case is an input plus a way to judge the output — sometimes an exact expected answer, more often a set of criteria or a reference to compare against. Where do cases come from?
- Production traces — the best source. Mine real user inputs (especially failures and thumbs-down) and curate them into cases. Your eval set should look like your traffic.
- Hand-written edge cases — the tricky inputs you know are hard: ambiguous queries, adversarial prompts, empty/garbage input, the long tail.
- Synthetic generation — an LLM can draft candidate cases to bootstrap coverage, but a human must review them or you're grading against made-up truth.
Two rules keep the set honest: label a held-out slice you never tune on (so you measure generalization, not memorization), and keep refreshing it from new traffic (a static set goes stale as usage shifts). Aim for coverage of the input distribution and the failure modes, not a huge count — a few hundred well-chosen cases beat thousands of near-duplicates.
How to Grade an Output
For each case you need a grader that turns an output into a score. There are four kinds, cheapest and most reliable first:
| Grader | When to use it |
|---|---|
| Heuristic / exact | There's a known answer or pattern: exact match, regex, contains-keyword, valid JSON, classification accuracy. Cheap, deterministic — use whenever possible. |
| Code-based assertion | Correctness is checkable by running code: does the generated SQL execute and return the right rows? does the code pass tests? Strong and objective. |
| LLM-as-judge | The criterion is fuzzy and language-y: is this summary faithful? is this answer helpful and on-tone? Scales human-like judgment cheaply. |
| Human review | The gold standard for nuance and for calibrating the others; too slow/costly for every run, so sample it. |
The art is matching grader to criterion: don't pay an LLM-judge to check something a regex can verify, and don't pretend a keyword match captures "is this explanation clear." Many real evals combine graders — a code check for format plus an LLM-judge for quality.
scores = []
for case in dataset:
out = system.run(case.input) # the app/agent under test
s = {
"format": is_valid_json(out), # heuristic
"correct": run_tests(out, case.tests), # code-based
"helpful": judge.score(case.input, out, rubric), # LLM-as-judge
}
scores.append(s)
report = aggregate(scores) # pass-rate per criterion
assert report.correct >= baseline.correct # CI gate: no regression
LLM-as-Judge, Done Carefully
For the fuzzy criteria — faithfulness, helpfulness, tone, "did it follow instructions" — a strong model can grade outputs at a fraction of the cost and latency of humans. This is LLM-as-judge, and it's what makes evaluating open-ended generation tractable. But a naive judge is unreliable, so treat it like a model you're deploying: design it, then validate it.
Make the judge reliable
- Give it a rubric. Don't ask "is this good?" — give explicit criteria and a scale ("score 1–5 on faithfulness: 5 = every claim supported by the source…"). Ask for a rationale before the score; it improves consistency.
- Prefer pairwise when you can. "Which answer is better, A or B?" is more reliable than absolute 1–5 scores, and it's exactly what you want when comparing two versions.
- Know the biases. Judges favor the first option (position bias — randomize order), longer answers (verbosity bias), and their own family's outputs (self-preference). Mitigate, don't ignore.
- Calibrate against humans. Have humans label a sample, then check the judge agrees (e.g. measure agreement/correlation). A judge you haven't validated is just another unverified model in your pipeline.
The judge is part of your system, so it needs its own eval. "We use an LLM to grade" is not a quality claim until you can say "and it agrees with human labels X% of the time." Calibration is what separates a trustworthy judge from circular self-assessment.
Evaluating Agents, Not Just Answers
Agents (see how agents work) raise the bar: a multi-step agent can reach a good final answer through a terrible path — or fail in the middle in ways a final-answer check misses. So evaluate the trajectory, not only the outcome:
- Task success — did it ultimately accomplish the goal? (the end-to-end metric)
- Tool-call correctness — did it call the right tools with the right arguments, and recover when one failed?
- Trajectory quality — a sane sequence of steps, no pointless loops, no repeating a failing action.
- Efficiency — steps, tokens, latency, and cost to get there; an agent that succeeds in 40 steps and $2 is worse than one that does it in 6 steps and 20¢.
Practically, you log the full trajectory (every prompt, tool call, and observation) and grade it — heuristics for tool-call validity and step/cost budgets, an LLM-judge for "was this a reasonable plan." This is why harness engineering and evals are joined at the hip: you tune the harness against trajectory evals.
Offline Gates and Online Evals
Evals run in two loops, and you need both.
Offline — gate the release
Run the eval suite in CI on every meaningful change (prompt edit, model swap, retrieval tweak). The suite reports pass-rates per criterion and compares against the current baseline; a drop blocks the merge, the same way a failing unit test does. This is what lets you change prompts without fear — the set catches the silent regressions.
Online — production never matches your set
Offline evals only cover what you thought to include, and real traffic always drifts, so evaluate in production too: A/B test a new version against the old on live traffic with real outcome metrics; run cheap guardrail evals on a sample of live outputs (an online LLM-judge flagging unfaithful or unsafe responses); and capture user feedback (thumbs, edits, regenerations) as a continuous, if noisy, quality signal. The failures you find online become tomorrow's offline cases — closing the loop in the diagram above.
Pitfalls and Tradeoffs
- Overfitting to the eval set. Tune long enough against a fixed set and you improve the score, not the product. Hold out a slice, refresh from traffic, and watch online metrics too.
- An unvalidated judge. Trusting LLM-as-judge without calibrating to humans bakes the judge's biases into every decision. Validate, then re-validate when you change the judge model.
- Contaminated or stale data. Cases leaked from training data inflate scores; cases that no longer resemble traffic mislead you. Curate and rotate.
- Optimizing one number. A single average hides trade-offs — quality can rise while cost or latency explodes, or one slice regresses while the mean improves. Track a small dashboard (per-criterion, per-slice), not one figure.
- Cost of evals. LLM-judge and big suites cost money and time; tier them — fast cheap checks on every commit, the full expensive suite nightly or pre-release.
Evals turn "seems better" into "is better, by this much." Build a realistic dataset from production, grade each criterion with the cheapest trustworthy method (heuristic → code → calibrated LLM-judge → human), evaluate agents on their trajectory and cost rather than just the final answer, gate releases on an offline suite, and keep online evals running because the eval set never fully matches reality. The team that can measure quality iterates faster than the team arguing about it.
Why can't you unit-test an LLM app? It's non-deterministic and "correct" is open-ended, so you evaluate statistically — a rate over many cases — not one input/output.
What's the most important part of an eval? The dataset — a curated golden set grown from production traffic, with a held-out slice you never tune on.
How do you grade fuzzy outputs? LLM-as-judge with an explicit rubric (prefer pairwise), validated against human labels and de-biased for position/verbosity.
How do you eval an agent? Beyond final-answer success, grade the trajectory: tool-call correctness, sane step order, no loops, and step/token/cost budgets.
Offline vs online evals? Offline suites gate releases in CI (catch regressions); online A/B + guardrail evals + user feedback catch what the offline set misses and feed new cases back.
Biggest risk? Overfitting to the eval set or trusting an uncalibrated judge — you optimize the metric instead of the product.