Case Study Archetype: The Silent Killer — Embedding Drift

A composite case study where the world changed, but the vectors stayed the same.


Quick Answer

This composite case study shows how tech-domain terminology drift broke retrieval and how re-embedding restored performance.

TL;DR

  • New acronyms pushed tech queries out-of-distribution.
  • Recall dropped while other domains stayed stable.
  • Re-embedding and lexical fallbacks restored recall.

FAQ

How do you detect embedding drift?

Monitor centroid shift, OOD ratios, and retrieval metrics like Recall@10 by domain slice.

Why was drift domain-specific?

Tech terms evolved faster than the embedding model vocabulary, while other domains stayed stable.

What fixed the issue?

Re-embedding the corpus with a newer model and adding lexical fallbacks for emerging terms.

About this case study

  • Composite archetype: Synthesized from multiple production deployments to illustrate real-world eval workflows.
  • Data: Numbers are illustrative and anonymized to show drift impact and remediation.
  • System: Multi-tenant enterprise search across finance, healthcare, tech, and retail.

System Snapshot

  • Index: 60M passages, 4 languages, 12 verticals.
  • Embedding model: 2021-vintage general encoder, updated quarterly.
  • Evaluation: monthly retrieval benchmark (1,000 queries per domain).
  • Primary metric: Recall@10 with a 0.70 minimum per domain.

Drift by Domain

Retrieval Performance Drop (3 Months) Finance 5% Healthcare 2% Tech 22% Retail 4%

Drift was isolated to tech queries. New acronyms (“GPT-4”, “Llama-3”, “QLoRA”) and product names were out-of-vocabulary for the 2021 encoder, collapsing nearest neighbors.

Where the Drift Happened

Embedding drift was measured as the cosine distance between query centroids month-to-month. Tech queries jumped from 0.08 to 0.34 (threshold 0.20). Retrieval recall@10 dropped 22 pts, while other domains stayed within 5 pts.

Metric (Tech) Baseline Drift Month Threshold
Recall@10 0.78 0.56 0.70
MRR 0.61 0.44 0.55
Query centroid shift 0.08 0.34 0.20
OOD query ratio 9% 28% 15%

Retrieval Health Dashboard

Mock enterprise search dashboard showing embedding drift impact by domain

Illustrative dashboard (synthetic data).

Tech Recall@10
0.78 → 0.56
-22 pts
Finance Recall@10
0.81 → 0.76
-5 pts
Tech OOD Ratio
9% → 28%
3.1x
Top-1 Doc Match
63% → 39%
-24 pts
Search Escalations
4.4% → 11.8%
+7.4 pts
Latency (p95)
720ms → 980ms
+260ms

Measurement Methodology (How This Would Be Measured)

  • Monthly benchmark set per domain with relevance judgments from SMEs.
  • Drift measured by centroid shift and OOD query ratio on embedding space.
  • Release gate: no domain may fall below Recall@10 = 0.70.
  • Escalation metrics pulled from search fallback logs and user feedback surveys.

What Changed Because of Evals

  1. Re-embedded the tech corpus with a newer encoder tuned on recent terminology.
  2. Added a lexical fallback (BM25) for new product names and acronyms.
  3. Created a “tech drift” slice (300 queries) that must pass before release.
  4. Added a rolling alias dictionary for emerging terms.
Metric (Tech) Drift Month After Fix
Recall@10 0.56 0.82
MRR 0.44 0.65
OOD Query Ratio 28% 12%
Escalation Rate 11.8% 4.1%
Key takeaway

Embedding drift is often domain-specific. Without domain-sliced evals, the system looked “healthy” while a critical segment was collapsing.