Home / Operating / 6a. Drift Monitoring

6a. Drift Monitoring

Production data drifts constantly. Your evals must drift with it. Detect shifts before they impact users.

Quick Answer

Drift monitoring detects shifts in query distribution and embedding space before they cause quality regressions.

TL;DR

Track cluster distribution shifts and similarity trends.
Alert on drift thresholds tied to risk.
Trigger retraining, guardrails, or human review.

FAQ

What is a drift score?

A drift score measures how much the current query distribution diverges from a baseline, often with Jensen-Shannon divergence.

How often should drift be checked?

Daily for high-volume systems; weekly for lower-volume systems or stable domains.

What actions follow an alert?

Investigate the slice, update the dataset, and decide between retrieval fixes, retraining, or escalation.

Types of Drift in AI Systems

Drift is the silent killer of production AI. Your model works great on day one, then slowly degrades as the world changes around it.

Query Distribution Drift

Users start asking different questions than your training data anticipated. Common in RAG systems.

Example: Month 3 users ask for edge cases instead of basic refunds.

Embedding / Semantic Drift

Your embedding model's understanding of terms doesn't match new content. Common in multi-tenant systems.

Example: "Inspection" meaning shifts across client domains.

1. Detecting Query Distribution Drift

The key insight: you can't just compare today's queries to yesterday's. You need to compare the distribution of semantic clusters.

drift_detection.py

def compute_drift_score(baseline: dict, new_queries: list[str]) -> float:
    # 1. Encode new queries
    new_embeddings = encoder.encode(new_queries)
    
    # 2. Assign to baseline clusters
    new_clusters = baseline_model.predict(new_embeddings)
    new_dist = np.bincount(new_clusters) / len(new_clusters)
    
    # 3. Calculate Jensen-Shannon divergence
    return jensenshannon(baseline_dist, new_dist)

2. Detecting Embedding Drift

Monitor retrieval confidence. A gradual decline in average Top-K cosine similarity indicates the embedding model is losing its grasp on the domain.

Alerting & Action

Escalate low-confidence queries to human support.
Feed human corrections back into the knowledge base.
Trigger retraining alerts when drift score > 0.15.