Quick Answer
This composite case study shows how tech-domain terminology drift broke retrieval and how re-embedding restored performance.
TL;DR
- New acronyms pushed tech queries out-of-distribution.
- Recall dropped while other domains stayed stable.
- Re-embedding and lexical fallbacks restored recall.
FAQ
How do you detect embedding drift?
Monitor centroid shift, OOD ratios, and retrieval metrics like Recall@10 by domain slice.
Why was drift domain-specific?
Tech terms evolved faster than the embedding model vocabulary, while other domains stayed stable.
What fixed the issue?
Re-embedding the corpus with a newer model and adding lexical fallbacks for emerging terms.
About this case study
- Composite archetype: Synthesized from multiple production deployments to illustrate real-world eval workflows.
- Data: Numbers are illustrative and anonymized to show drift impact and remediation.
- System: Multi-tenant enterprise search across finance, healthcare, tech, and retail.
System Snapshot
- Index: 60M passages, 4 languages, 12 verticals.
- Embedding model: 2021-vintage general encoder, updated quarterly.
- Evaluation: monthly retrieval benchmark (1,000 queries per domain).
- Primary metric: Recall@10 with a 0.70 minimum per domain.
Drift by Domain
Drift was isolated to tech queries. New acronyms (“GPT-4”, “Llama-3”, “QLoRA”) and product names were out-of-vocabulary for the 2021 encoder, collapsing nearest neighbors.
Where the Drift Happened
Embedding drift was measured as the cosine distance between query centroids month-to-month. Tech queries jumped from 0.08 to 0.34 (threshold 0.20). Retrieval recall@10 dropped 22 pts, while other domains stayed within 5 pts.
| Metric (Tech) | Baseline | Drift Month | Threshold |
|---|---|---|---|
| Recall@10 | 0.78 | 0.56 | 0.70 |
| MRR | 0.61 | 0.44 | 0.55 |
| Query centroid shift | 0.08 | 0.34 | 0.20 |
| OOD query ratio | 9% | 28% | 15% |
Retrieval Health Dashboard
Illustrative dashboard (synthetic data).
Measurement Methodology (How This Would Be Measured)
- Monthly benchmark set per domain with relevance judgments from SMEs.
- Drift measured by centroid shift and OOD query ratio on embedding space.
- Release gate: no domain may fall below Recall@10 = 0.70.
- Escalation metrics pulled from search fallback logs and user feedback surveys.
What Changed Because of Evals
- Re-embedded the tech corpus with a newer encoder tuned on recent terminology.
- Added a lexical fallback (BM25) for new product names and acronyms.
- Created a “tech drift” slice (300 queries) that must pass before release.
- Added a rolling alias dictionary for emerging terms.
| Metric (Tech) | Drift Month | After Fix |
|---|---|---|
| Recall@10 | 0.56 | 0.82 |
| MRR | 0.44 | 0.65 |
| OOD Query Ratio | 28% | 12% |
| Escalation Rate | 11.8% | 4.1% |
Embedding drift is often domain-specific. Without domain-sliced evals, the system looked “healthy” while a critical segment was collapsing.