Data, Golden Sets, and Baselines

The hard part. Why most golden sets rot, how to bootstrap from customer reality, and when to refresh.


Quick Answer

Reliable evals need curated data and stable baselines. This page shows how to build golden sets, set thresholds, and refresh them as production data drifts.

TL;DR

  • Sample real queries and label them with a clear rubric.
  • Track baseline metrics and define release thresholds.
  • Refresh datasets on a cadence tied to drift and product change.

FAQ

How big should a golden set be?

Start with 200 to 500 examples for core flows, then add slices for edge cases and high-risk intents.

How often should I refresh eval data?

At least monthly for dynamic systems, and immediately after major policy or product changes.

What is a baseline?

A baseline is the current best-known performance used for comparison. It anchors regressions and release gates.

What a Golden Set Is (and Isn't)

A Golden Set is a curated baseline of representative inputs and expected outcomes. It is not a full map of everything the system can answer.

Golden Set Is Golden Set Isn't
A fixed baseline for regression testing A complete list of every possible question
Focused on critical user journeys A single snapshot that never changes
Owned and versioned internally A customer-facing promise of coverage

Why Golden Sets Rot

Most golden sets become useless within months. Three forces cause decay:

1. Staleness

Your product changes, policies update, and new features launch. Questions that were critical 6 months ago may no longer be relevant. Expected answers reference outdated information.

2. Distribution Shift

Early users ask different questions than mature users. Your golden set reflects the queries from launch, not the queries from today. Month 3 users ask edge cases your month 1 set doesn't cover.

3. Scope Creep

Teams add test cases ad-hoc without retiring old ones. The set grows to 500+ cases, runs take forever, and nobody knows which cases actually matter.

The Rot Symptom

When your eval suite passes consistently but users still complain, your golden set has rotted. You're testing the past while users live in the present.

Building a Golden Set from Customer Reality

Start with 50-100 cases covering these categories:

From customer questions to an internal golden set

PMs collect representative questions using an intake template. Engineering turns those into a golden set for evals.

Customer Questions PM Intake Template Golden Set (Internal) Eval Runs & Iteration
  • Start with top workflows: 10-20 questions that map to your most common user intents.
  • Include edge cases: high-risk or policy-sensitive asks.
  • Pull from reality: sales demos, support tickets, and production logs (anonymized).
  • Define success: for each question, specify what a "good" answer must include or avoid.
dataset_sample.json
[
  {
    "input": "I want to return my order #12345",
    "context": ["Return Policy: Returns allowed within 30 days..."],
    "expected_output": "I can help with that. Is the item in its original condition?",
    "tags": ["support", "returns"],
    "tier": "high"
  }
]

Synthetic vs Real Data

Both have a place. Neither is sufficient alone.

Aspect Real Data Synthetic Data
Distribution Matches actual usage May miss real patterns
Edge cases Hard to get enough volume Can generate at scale
Adversarial Limited malicious examples Easy to create attacks
Privacy Requires anonymization No PII concerns
Cost Expensive to label Cheap to generate

The Hybrid Approach

Use real data for your core golden set (50-100 cases that define success). Use synthetic data to stress-test edges: adversarial inputs, rare intents, and safety scenarios you can't source from production.

Selecting Metrics

Avoid "kitchen sink" metrics. Choose 2-3 that actually map to business value.

Metric What it measures When to use
Faithfulness Does the answer come only from the context? RAG systems (prevent hallucinations).
Semantic Similarity Is the meaning close to the Golden Answer? (Embedding distance) Q&A where accurate phrasing matters.
Tool Call Safety Did the agent call delete_db with correct params? Agentic/Action-based systems.

Refresh Strategies Without Chasing Noise

Golden sets are living artifacts. But you can't refresh constantly—you'll lose the baseline.

Quarterly Review Cadence

  • Add new cases when new workflows or intents emerge.
  • Retire cases when they no longer reflect real usage.
  • Re-label when expected answers are outdated.
  • Version every change so you can compare across time.

Signals That Trigger Refresh

  • New product feature or policy change.
  • Drift alert from production monitoring.
  • Support escalations in a new category.
  • Eval scores plateau while user complaints rise.
Version Run Review Add new intents Retire stale cases

The Evaluation Loop

Run your evals as part of your CI/CD pipeline, not just locally on your laptop.

pipeline_runner.py
for example in golden_dataset:
    # 1. Generate
    actual_output = model.generate(example["input"])

    # 2. Evaluate (Parallelize this in production)
    faith_score = measure_faithfulness(actual_output, example["context"])
    sim_score = measure_similarity(actual_output, example["expected_output"])

    # 3. Log
    logger.log({
        "input": example["input"],
        "scores": { "faithfulness": faith_score, "similarity": sim_score }
    })

For real-world context, see Query Drift and Embedding Drift.