Case Study Archetype: The "Holiday Season" Drift

A composite case study showing how a 92% accurate RAG system dropped to 68% overnight.


Quick Answer

This composite case study shows how holiday query drift collapsed retrieval quality and how evals isolated the intent cluster and guided fixes.

TL;DR

  • Intent mix shifted toward gift returns and policy edge cases.
  • Retrieval mapped to irrelevant API docs, tanking adherence.
  • Evals drove seasonal routing and policy-first retrieval fixes.

FAQ

Where did the drift occur?

Drift concentrated in holiday-specific intents like gift receipts and no-payment returns.

Which metrics flagged the incident?

Policy adherence and context precision dropped sharply, alongside a spike in escalations and refunds.

What changed after evals?

The system added seasonal intent routing, lexical fallback, and expanded golden sets for holiday queries.

About this case study

  • Composite archetype: Synthesized from multiple production deployments to illustrate real-world eval workflows.
  • Data: Numbers are illustrative and anonymized to show how evals surfaced the issue and quantified impact.
  • System: Retail support RAG assistant handling order, return, and policy questions.

System Snapshot

  • Traffic: ~1.2M sessions/month, 64% self-serve deflection target.
  • Stack: hybrid retrieval (BM25 + embeddings), top-6 chunks, policy-first prompt.
  • Knowledge base: 18k policy and catalog articles, updated weekly.
  • Evaluation: weekly golden set (1,800 queries) + 5% shadow eval on live traffic.
  • Failure cost: incorrect return guidance triggers refunds, chargebacks, and CS escalations.

The Timeline

92% Normal Ops Oct 18 85% Black Friday Nov 24 68% Incident Nov 25 Alerts Discovery Nov 25 94% Fix Deployed Nov 27

Accuracy collapsed in under 24 hours as the query mix flipped from order status to gift returns, exchanges, and holiday policy edge cases.

Where the Drift Happened (and Why It Mattered)

Drift was not global — it was concentrated in a new intent cluster. The semantic retriever mapped “gift receipt” and “no original payment method” to developer docs about receipts and API receipts, starving the LLM of policy context.

Intent Cluster Share (Oct) Share (Black Friday) Impact
Order tracking 58% 22% Medium
Returns & exchanges 12% 34% High
Gift receipts / no payment method 3% 18% Critical
Promo / refund policy edge cases 7% 14% High
Fraud / chargeback 2% 7% Critical

Query embedding drift score (Jensen-Shannon divergence on intent clusters) jumped from 0.08 to 0.42, crossing the 0.25 alert threshold.

Eval Design That Caught It

  • Shadow eval sampled 5% of live traffic and scored with a policy adherence rubric.
  • Severity weighting: refund and fraud intents carried 5x penalty vs order tracking.
  • Retrieval checks: context precision >= 0.70 required for “policy” intents.
  • Alerts triggered on any 10+ point drop in adherence or 2x escalation rate.

Measurement Methodology (How This Would Be Measured)

  • Weekly golden set + shadow eval on 5% of live traffic, stratified by intent.
  • Policy adherence scored with a rubric: correct policy, correct exception handling, citations present.
  • Retrieval metrics computed on labeled relevance: precision@k and recall@k by intent slice.
  • Operational impact derived from support logs: escalations, refunds, and CSAT deltas.

Impact Dashboard

Mock dashboard showing KPIs and intent mix during the query drift incident

Illustrative dashboard (synthetic data).

Answer Accuracy
92% → 68%
-24 pts during drift
Policy Adherence
88% → 41%
-47 pts during drift
Context Precision
0.74 → 0.38
-0.36
Escalation Rate
9% → 24%
2.7x increase
Refund Exceptions
1.2% → 4.8%
4x increase
CSAT (Post-Chat)
4.6 → 3.2
-1.4
Metric Baseline (Oct) Drift (Nov 25) After Fix (Nov 27)
Answer Accuracy 92% 68% 94%
Policy Adherence 88% 41% 90%
Context Precision 0.74 0.38 0.76
Escalation Rate 9% 24% 8%

What Changed Because of Evals

  1. Added a seasonal intent detector to route “gift receipt” and “no payment method” queries to a dedicated policy pack.
  2. Introduced hybrid retrieval with a lexical fallback for policy keywords.
  3. Inserted a “policy citation required” guardrail; responses without citations auto-escalated.
  4. Expanded the golden set with 400 holiday queries and added a “refund exception” category.
  5. Lowered alert thresholds during known seasonal shifts.
Key takeaway

Drift wasn’t about model quality — it was about intent mix. Evals isolated the failure cluster, making the fix targeted and fast.