The 5 Levels
Most teams are at Level 1-2. The goal isn't perfection—it's moving one level up, sustainably.
1 Ad Hoc — "We eyeball it"
Evaluation happens informally. Someone spot-checks a few outputs before release.
- No formal test set exists
- Quality assessed by "vibes" — team member reads a few outputs
- No metrics tracked over time
- Failures discovered by customers, not evaluations
- No defined owner for evaluation
2 Defined — "We have a test set"
A golden set exists, basic metrics are tracked, but evaluation is manual and sporadic.
- Golden set of 50-200 examples maintained
- 1-3 metrics defined (accuracy, latency, faithfulness)
- Evaluation runs before major releases
- Results shared in release notes or Slack
- One team member informally owns eval
3 Automated — "Evals run in CI"
Evaluations are automated and integrated into the development workflow.
- Eval pipeline runs on every PR or nightly
- LLM-as-Judge calibrated against human ratings (≥85% agreement)
- Regression tests for known failures
- Results dashboard accessible to the team
- Formal eval owner with defined responsibilities
- Release gates based on eval metrics
4 Measured — "We catch drift before users feel it"
Production monitoring closes the loop between deployment and evaluation.
- Query distribution drift monitored in real time
- Embedding drift alerts configured
- Consequence-weighted scoring prioritizes what matters
- Human escalation rate tracked as a leading indicator
- Human corrections feed back into golden set
- Weekly eval reports shared with leadership
- Canary deployments for new model versions
5 Optimized — "Evals drive product decisions"
Evaluation is a strategic capability, not just quality assurance.
- Eval metrics directly tied to revenue, NPS, and support costs
- A/B testing uses eval scores as guardrails
- Golden sets auto-refresh with production data
- Cross-functional eval review board (PM, Eng, Legal, Support)
- Eval insights drive product roadmap decisions
- Compliance and regulatory evals automated
- Organization-wide eval standards and shared tooling
Quick Self-Assessment
Answer these questions to identify your current level:
Assessment Questions
Maturity Roadmap Template
| Current Level | Target Level | Key Actions | Timeline | Owner |
|---|---|---|---|---|
| Level 1 | Level 2 | Build golden set, define 1 metric, assign eval owner | 2 weeks | PM |
| Level 2 | Level 3 | Automate pipeline, add LLM judge, set release gates | 1 month | Eng Lead |
| Level 3 | Level 4 | Add drift monitoring, consequence weighting, HITL | 2 months | ML Eng |