AI systems drift silently. Providers update models without notice; your prompts accumulate complexity; user input patterns shift. Each drift type affects quality differently and requires different detection. This post is the taxonomy of drift and specific detection mechanisms for each type.

Three drift types

Model drift: provider updates silently, behavior shifts. Prompt drift: system prompts grow, context accumulates. Data drift: user input patterns shift.

Model drift

Providers update models. Sometimes announced, sometimes quietly. Behavior can shift even with same model string.

Detection. Run golden eval set hourly. Compare pass rate over time. Sudden drop = likely model change.

Mitigation. Where possible, pin model versions explicitly. Most providers expose dated versions (e.g., claude-opus-4-20260101). Use these for stability.

Testing before migration. When provider announces new model version, test it against golden evals before switching. Measure impact; decide to migrate or hold.

Prompt drift

System prompts accumulate. Each feature addition, each edge case, each customer request adds context. Soon system prompt is 3000 tokens when it was 300.

Cost impact. Token growth translates to cost growth. Also latency; larger prompts slower to process.

Quality impact. Long system prompts sometimes dilute instructions; model attention divides among too many things.

Detection. Token-per-request metric over time. Sudden jumps correlate with deploys; gradual growth indicates drift.

Mitigation. Quarterly prompt audits. Remove deprecated instructions. Consolidate. Test quality impact of changes.

Data drift

User input distribution shifts. Users find new use cases; use patterns evolve; bot activity changes composition.

Quality implications. Model trained on one distribution; quality degrades on shifted distribution.

Detection. Input feature distribution monitoring. Sample and analyze periodically. Semantic clustering shows pattern changes.

Mitigation. Retrain or fine-tune on newer data. Update RAG indexes. Add new use cases to eval set.

RAG-specific drift

Index staleness. Retrieval pulls from indexed content; when content updates, index may lag.

Retrieval quality metric. What percent of queries retrieve relevant results? Track over time.

Document distribution. New document types being indexed; taxonomy evolves.

Embedding model drift. If embedding model updates, indexed embeddings need regenerating; otherwise retrieval quality degrades.

Cache drift

Cached responses from old model version. User hits cache, gets old-model response. Feels stale.

Cache invalidation policy. Clear cache on model update; or TTL cache entries; or don't cache across model versions.

See prompt cache warming post for related caching patterns.

Monitoring infrastructure

Baseline metrics. What does normal look like? Quality, cost, latency, user engagement over trailing period.

Anomaly detection. Statistical or ML-based detection of metrics out of expected range.

Alerting thresholds. Tuned carefully. Too sensitive = alert fatigue; too lax = missed drift.

Dashboards. Drift-specific views. Token trends; input distribution; retrieval quality; eval pass rates.

Response patterns

Model drift detected. Test against other model versions; decide to pin or migrate; communicate changes if user-visible.

Prompt drift detected. Audit prompt; consolidate; A/B test replacement; monitor.

Data drift detected. Update training or fine-tuning; update RAG index; expand eval set with new use cases.

Organizational practices

Quarterly drift review. Dedicated meeting. Review metrics; identify ongoing drift; plan mitigations.

Drift ownership. Model drift: AI platform team. Prompt drift: product teams. Data drift: ML/data team. Clear responsibilities.

Incident review. When drift causes user-visible problems, postmortem to improve detection and response.

AI drift detection: catching silent model changes

Model drift

Prompt drift

Data drift

RAG-specific drift

Cache drift

Monitoring infrastructure

Response patterns

Organizational practices

Continue the thread.

AI quality monitoring in production

Why evaluation infrastructure matters more than prompts

Want to talk about this?