Frequently Asked Questions

Common questions about evaluating AI systems in production.


Quick Answer

This page answers common questions about evaluation setup, ownership, cost, and rollout. Start small, focus on risk, and iterate with real data.

TL;DR

  • Start with one high-risk workflow and a small labeled set.
  • Define a rubric and automate what you can.
  • Scale once you have clear signals and owners.

FAQ Highlights

How do I start evals with limited time?

Pick a single high-impact workflow, define a clear rubric, and evaluate 50 to 200 examples before release.

Who should own evals?

A product or eval lead typically owns the rubric and KPIs, with engineering and support closing the loop.

How expensive is evaluation?

Early stages are low cost and mostly labeling time; costs rise only when you scale automated judging.

Basics

Why can't I just use accuracy?

Accuracy treats all errors equally, but in production, a legal compliance error is far more costly than a minor phrasing issue. Consequence weighting surfaces errors that actually matter.

How many test cases do I need?

Start with 50-100 covering critical scenarios. For production, aim for 500+ with good edge case coverage. Quality matters more than quantity.

Production

How do I know if my model is degrading?

Monitor for drift: track query distributions and human escalation rates. Set alerts for unexpected shifts.

When should humans review AI outputs?

When confidence falls below threshold, in high-stakes domains, and randomly for auditing. See Human-in-the-Loop.