Quick Answer
This page answers common questions about evaluation setup, ownership, cost, and rollout. Start small, focus on risk, and iterate with real data.
TL;DR
- Start with one high-risk workflow and a small labeled set.
- Define a rubric and automate what you can.
- Scale once you have clear signals and owners.
FAQ Highlights
How do I start evals with limited time?
Pick a single high-impact workflow, define a clear rubric, and evaluate 50 to 200 examples before release.
Who should own evals?
A product or eval lead typically owns the rubric and KPIs, with engineering and support closing the loop.
How expensive is evaluation?
Early stages are low cost and mostly labeling time; costs rise only when you scale automated judging.
Basics
Why can't I just use accuracy?
Accuracy treats all errors equally, but in production, a legal compliance error is far more costly than a minor phrasing issue. Consequence weighting surfaces errors that actually matter.
How many test cases do I need?
Start with 50-100 covering critical scenarios. For production, aim for 500+ with good edge case coverage. Quality matters more than quantity.
Production
How do I know if my model is degrading?
Monitor for drift: track query distributions and human escalation rates. Set alerts for unexpected shifts.
When should humans review AI outputs?
When confidence falls below threshold, in high-stakes domains, and randomly for auditing. See Human-in-the-Loop.