Quick Answer
Reliable evals need curated data and stable baselines. This page shows how to build golden sets, set thresholds, and refresh them as production data drifts.
TL;DR
- Sample real queries and label them with a clear rubric.
- Track baseline metrics and define release thresholds.
- Refresh datasets on a cadence tied to drift and product change.
FAQ
How big should a golden set be?
Start with 200 to 500 examples for core flows, then add slices for edge cases and high-risk intents.
How often should I refresh eval data?
At least monthly for dynamic systems, and immediately after major policy or product changes.
What is a baseline?
A baseline is the current best-known performance used for comparison. It anchors regressions and release gates.
What a Golden Set Is (and Isn't)
A Golden Set is a curated baseline of representative inputs and expected outcomes. It is not a full map of everything the system can answer.
| Golden Set Is | Golden Set Isn't |
|---|---|
| A fixed baseline for regression testing | A complete list of every possible question |
| Focused on critical user journeys | A single snapshot that never changes |
| Owned and versioned internally | A customer-facing promise of coverage |
Why Golden Sets Rot
Most golden sets become useless within months. Three forces cause decay:
1. Staleness
Your product changes, policies update, and new features launch. Questions that were critical 6 months ago may no longer be relevant. Expected answers reference outdated information.
2. Distribution Shift
Early users ask different questions than mature users. Your golden set reflects the queries from launch, not the queries from today. Month 3 users ask edge cases your month 1 set doesn't cover.
3. Scope Creep
Teams add test cases ad-hoc without retiring old ones. The set grows to 500+ cases, runs take forever, and nobody knows which cases actually matter.
The Rot Symptom
When your eval suite passes consistently but users still complain, your golden set has rotted. You're testing the past while users live in the present.
Building a Golden Set from Customer Reality
Start with 50-100 cases covering these categories:
From customer questions to an internal golden set
PMs collect representative questions using an intake template. Engineering turns those into a golden set for evals.
- Start with top workflows: 10-20 questions that map to your most common user intents.
- Include edge cases: high-risk or policy-sensitive asks.
- Pull from reality: sales demos, support tickets, and production logs (anonymized).
- Define success: for each question, specify what a "good" answer must include or avoid.
[
{
"input": "I want to return my order #12345",
"context": ["Return Policy: Returns allowed within 30 days..."],
"expected_output": "I can help with that. Is the item in its original condition?",
"tags": ["support", "returns"],
"tier": "high"
}
]
Synthetic vs Real Data
Both have a place. Neither is sufficient alone.
| Aspect | Real Data | Synthetic Data |
|---|---|---|
| Distribution | Matches actual usage | May miss real patterns |
| Edge cases | Hard to get enough volume | Can generate at scale |
| Adversarial | Limited malicious examples | Easy to create attacks |
| Privacy | Requires anonymization | No PII concerns |
| Cost | Expensive to label | Cheap to generate |
The Hybrid Approach
Use real data for your core golden set (50-100 cases that define success). Use synthetic data to stress-test edges: adversarial inputs, rare intents, and safety scenarios you can't source from production.
Selecting Metrics
Avoid "kitchen sink" metrics. Choose 2-3 that actually map to business value.
| Metric | What it measures | When to use |
|---|---|---|
| Faithfulness | Does the answer come only from the context? | RAG systems (prevent hallucinations). |
| Semantic Similarity | Is the meaning close to the Golden Answer? (Embedding distance) | Q&A where accurate phrasing matters. |
| Tool Call Safety | Did the agent call delete_db with correct params? |
Agentic/Action-based systems. |
Refresh Strategies Without Chasing Noise
Golden sets are living artifacts. But you can't refresh constantly—you'll lose the baseline.
Quarterly Review Cadence
- Add new cases when new workflows or intents emerge.
- Retire cases when they no longer reflect real usage.
- Re-label when expected answers are outdated.
- Version every change so you can compare across time.
Signals That Trigger Refresh
- New product feature or policy change.
- Drift alert from production monitoring.
- Support escalations in a new category.
- Eval scores plateau while user complaints rise.
The Evaluation Loop
Run your evals as part of your CI/CD pipeline, not just locally on your laptop.
for example in golden_dataset:
# 1. Generate
actual_output = model.generate(example["input"])
# 2. Evaluate (Parallelize this in production)
faith_score = measure_faithfulness(actual_output, example["context"])
sim_score = measure_similarity(actual_output, example["expected_output"])
# 3. Log
logger.log({
"input": example["input"],
"scores": { "faithfulness": faith_score, "similarity": sim_score }
})
For real-world context, see Query Drift and Embedding Drift.