Home / Case Studies / Embedding Drift

Case Study Archetype: The Silent Killer — Embedding Drift

A composite case study where the world changed, but the vectors stayed the same.

Quick Answer

This composite case study shows how tech-domain terminology drift broke retrieval and how re-embedding restored performance.

TL;DR

New acronyms pushed tech queries out-of-distribution.
Recall dropped while other domains stayed stable.
Re-embedding and lexical fallbacks restored recall.

FAQ

How do you detect embedding drift?

Monitor centroid shift, OOD ratios, and retrieval metrics like Recall@10 by domain slice.

Why was drift domain-specific?

Tech terms evolved faster than the embedding model vocabulary, while other domains stayed stable.

What fixed the issue?

Re-embedding the corpus with a newer model and adding lexical fallbacks for emerging terms.

About this case study

Composite archetype: Synthesized from multiple production deployments to illustrate real-world eval workflows.
Data: Numbers are illustrative and anonymized to show drift impact and remediation.
System: Multi-tenant enterprise search across finance, healthcare, tech, and retail.

System Snapshot

Index: 60M passages, 4 languages, 12 verticals.
Embedding model: 2021-vintage general encoder, updated quarterly.
Evaluation: monthly retrieval benchmark (1,000 queries per domain).
Primary metric: Recall@10 with a 0.70 minimum per domain.

Drift by Domain

Drift was isolated to tech queries. New acronyms (“GPT-4”, “Llama-3”, “QLoRA”) and product names were out-of-vocabulary for the 2021 encoder, collapsing nearest neighbors.

Where the Drift Happened

Embedding drift was measured as the cosine distance between query centroids month-to-month. Tech queries jumped from 0.08 to 0.34 (threshold 0.20). Retrieval recall@10 dropped 22 pts, while other domains stayed within 5 pts.

Metric (Tech)	Baseline	Drift Month	Threshold
Recall@10	0.78	0.56	0.70
MRR	0.61	0.44	0.55
Query centroid shift	0.08	0.34	0.20
OOD query ratio	9%	28%	15%

Retrieval Health Dashboard

Illustrative dashboard (synthetic data).

Tech Recall@10

0.78 → 0.56

-22 pts

Finance Recall@10

0.81 → 0.76

-5 pts

Tech OOD Ratio

9% → 28%

3.1x

Top-1 Doc Match

63% → 39%

-24 pts

Search Escalations

4.4% → 11.8%

+7.4 pts

Latency (p95)

720ms → 980ms

+260ms

Measurement Methodology (How This Would Be Measured)

Monthly benchmark set per domain with relevance judgments from SMEs.
Drift measured by centroid shift and OOD query ratio on embedding space.
Release gate: no domain may fall below Recall@10 = 0.70.
Escalation metrics pulled from search fallback logs and user feedback surveys.

What Changed Because of Evals

Re-embedded the tech corpus with a newer encoder tuned on recent terminology.
Added a lexical fallback (BM25) for new product names and acronyms.
Created a “tech drift” slice (300 queries) that must pass before release.
Added a rolling alias dictionary for emerging terms.

Metric (Tech)	Drift Month	After Fix
Recall@10	0.56	0.82
MRR	0.44	0.65
OOD Query Ratio	28%	12%
Escalation Rate	11.8%	4.1%

Key takeaway

Embedding drift is often domain-specific. Without domain-sliced evals, the system looked “healthy” while a critical segment was collapsing.