Home / Case Studies / Query Drift

Case Study Archetype: The "Holiday Season" Drift

A composite case study showing how a 92% accurate RAG system dropped to 68% overnight.

Quick Answer

This composite case study shows how holiday query drift collapsed retrieval quality and how evals isolated the intent cluster and guided fixes.

TL;DR

Intent mix shifted toward gift returns and policy edge cases.
Retrieval mapped to irrelevant API docs, tanking adherence.
Evals drove seasonal routing and policy-first retrieval fixes.

FAQ

Where did the drift occur?

Drift concentrated in holiday-specific intents like gift receipts and no-payment returns.

Which metrics flagged the incident?

Policy adherence and context precision dropped sharply, alongside a spike in escalations and refunds.

What changed after evals?

The system added seasonal intent routing, lexical fallback, and expanded golden sets for holiday queries.

About this case study

Composite archetype: Synthesized from multiple production deployments to illustrate real-world eval workflows.
Data: Numbers are illustrative and anonymized to show how evals surfaced the issue and quantified impact.
System: Retail support RAG assistant handling order, return, and policy questions.

System Snapshot

Traffic: ~1.2M sessions/month, 64% self-serve deflection target.
Stack: hybrid retrieval (BM25 + embeddings), top-6 chunks, policy-first prompt.
Knowledge base: 18k policy and catalog articles, updated weekly.
Evaluation: weekly golden set (1,800 queries) + 5% shadow eval on live traffic.
Failure cost: incorrect return guidance triggers refunds, chargebacks, and CS escalations.

The Timeline

Accuracy collapsed in under 24 hours as the query mix flipped from order status to gift returns, exchanges, and holiday policy edge cases.

Where the Drift Happened (and Why It Mattered)

Drift was not global — it was concentrated in a new intent cluster. The semantic retriever mapped “gift receipt” and “no original payment method” to developer docs about receipts and API receipts, starving the LLM of policy context.

Intent Cluster	Share (Oct)	Share (Black Friday)	Impact
Order tracking	58%	22%	Medium
Returns & exchanges	12%	34%	High
Gift receipts / no payment method	3%	18%	Critical
Promo / refund policy edge cases	7%	14%	High
Fraud / chargeback	2%	7%	Critical

Query embedding drift score (Jensen-Shannon divergence on intent clusters) jumped from 0.08 to 0.42, crossing the 0.25 alert threshold.

Eval Design That Caught It

Shadow eval sampled 5% of live traffic and scored with a policy adherence rubric.
Severity weighting: refund and fraud intents carried 5x penalty vs order tracking.
Retrieval checks: context precision >= 0.70 required for “policy” intents.
Alerts triggered on any 10+ point drop in adherence or 2x escalation rate.

Measurement Methodology (How This Would Be Measured)

Weekly golden set + shadow eval on 5% of live traffic, stratified by intent.
Policy adherence scored with a rubric: correct policy, correct exception handling, citations present.
Retrieval metrics computed on labeled relevance: precision@k and recall@k by intent slice.
Operational impact derived from support logs: escalations, refunds, and CSAT deltas.

Impact Dashboard

Mock dashboard showing KPIs and intent mix during the query drift incident

Illustrative dashboard (synthetic data).

Answer Accuracy

92% → 68%

-24 pts during drift

Policy Adherence

88% → 41%

-47 pts during drift

Context Precision

0.74 → 0.38

-0.36

Escalation Rate

9% → 24%

2.7x increase

Refund Exceptions

1.2% → 4.8%

4x increase

CSAT (Post-Chat)

4.6 → 3.2

-1.4

Metric	Baseline (Oct)	Drift (Nov 25)	After Fix (Nov 27)
Answer Accuracy	92%	68%	94%
Policy Adherence	88%	41%	90%
Context Precision	0.74	0.38	0.76
Escalation Rate	9%	24%	8%

What Changed Because of Evals

Added a seasonal intent detector to route “gift receipt” and “no payment method” queries to a dedicated policy pack.
Introduced hybrid retrieval with a lexical fallback for policy keywords.
Inserted a “policy citation required” guardrail; responses without citations auto-escalated.
Expanded the golden set with 400 holiday queries and added a “refund exception” category.
Lowered alert thresholds during known seasonal shifts.

Key takeaway

Drift wasn’t about model quality — it was about intent mix. Evals isolated the failure cluster, making the fix targeted and fast.