What Are Evals?

Not testing. Not benchmarks. The discipline of knowing whether your AI system actually works.


Quick Answer

Evals are structured tests that measure AI system outputs against clear success criteria. They can be automated, human-reviewed, or LLM-judged and are used for release gates and ongoing monitoring.

TL;DR

  • Evals define what success looks like for a real task.
  • They use representative data and rubrics to score outputs.
  • They are used both pre-release and in production monitoring.

FAQ

Are evals the same as benchmarks?

No. Benchmarks are generic and static; evals are task-specific, risk-aware, and updated with live data as the system changes.

How much data do I need for an eval?

Start small: 50 to 200 labeled examples per critical slice can reveal regressions. Expand as you learn the failure modes.

What is the fastest way to start?

Pick one high-risk workflow, define a rubric, label a small dataset, and run a simple baseline before each release.

Evals Are Not What You Think

If you come from traditional software engineering, you might hear "evaluation" and think unit tests. If you come from machine learning research, you might think benchmarks. Both intuitions will mislead you.

Unit tests verify deterministic behavior: given input X, the function must return Y. They work because software is predictable. add(2, 3) always returns 5. But ask an LLM to summarize a legal contract, and you'll get a different answer every time—even with the same prompt, the same model, the same temperature. The output is probabilistic, not deterministic.

Benchmarks (MMLU, HumanEval, GPQA) measure general model capability. They tell you whether GPT-4 is smarter than GPT-3.5 on average, across thousands of academic questions. But they tell you nothing about whether your system, with your prompts, over your data, produces outputs your users can trust.

Evals are neither. They sit in the gap between generic benchmarks and deterministic tests—the gap where production AI systems actually live.

A Working Definition

Evals are systematic measurements of whether an AI system meets specific, domain-relevant quality bars—run continuously, against real-world conditions, to catch degradation before users do.

The Probabilistic Problem

Traditional software has a comforting property: it either works or it doesn't. A REST API either returns the right JSON or it throws an error. You can write a test, and it passes or fails.

AI systems don't have this property. They exist on a spectrum:

  • Sometimes right, sometimes wrong. The same question might get a correct answer 9 out of 10 times—and a confidently wrong answer on the 10th.
  • Right in different ways. Two valid summaries of the same document can look completely different. Which one is "correct"?
  • Wrong in invisible ways. A hallucinated citation looks exactly like a real one. A subtly incorrect legal interpretation reads just as fluently as the right one.
  • Right today, wrong tomorrow. A model update, a prompt change, a shift in user behavior—any of these can silently degrade quality without triggering a single error.

This is why traditional QA fails for AI. You can't write an assertion for "the summary is good." You need a different approach entirely.

Why Traditional QA Doesn't Work

Consider a customer support chatbot. In traditional software, you'd test:

traditional_test.py
# Traditional QA: deterministic assertion
def test_return_policy():
    response = get_response("What is your return policy?")
    assert response == "You can return items within 30 days."

This test will fail immediately. The AI might say "Our return window is 30 days from purchase" or "Returns are accepted within a month of buying." Both are correct. The assertion is wrong—not the AI.

Evals solve this by measuring properties of the output rather than matching exact strings:

  • Faithfulness: Does the answer come only from the provided context? (No hallucinations)
  • Completeness: Does it cover the key information the user needs?
  • Safety: Does it avoid giving advice the company shouldn't give?
  • Relevance: Does it actually answer what was asked?

These properties can be measured at scale—by other models, by embedding similarity, by deterministic rules—and tracked over time to catch regression.

"Evals Are Surprisingly Often All You Need"

From the Field

Greg Brockman, OpenAI co-founder, put it bluntly: "Evals are surprisingly often all you need." Not more data. Not a bigger model. Not a fancier architecture. Just the discipline of measuring what matters and iterating on what you find.

This insight cuts against the instinct most teams have. When an AI system underperforms, the default reaction is to reach for a more powerful model, add more training data, or redesign the prompt from scratch. But often the real problem is simpler: you don't know what "good" looks like, so you can't tell whether your changes actually helped.

Evals give you that definition. They turn "I think it's working better" into "faithfulness improved from 0.82 to 0.91 after the prompt change, while safety scores held steady." They make improvement measurable.

The Shift: From "Does It Work?" to "Can I Trust It?"

The deeper question behind evals isn't "is this output correct?" It's "can I trust this system to keep producing good outputs as conditions change?"

This is a fundamentally different question. Correctness is a snapshot. Trust is a trajectory. It requires:

  • Continuous measurement—not a one-time check before launch
  • Business-relevant metrics—not abstract accuracy scores
  • Failure awareness—knowing what kinds of errors matter most
  • Drift detection—catching degradation before users report it

That's what this handbook is about. Not how to run a benchmark. How to build the ongoing discipline of knowing whether your AI system deserves the trust you're placing in it.

In the next section, we'll look at why this matters more than most teams realize—and what happens when you skip it.