Quality monitoring for AI systems catches regressions before customers do. Continuous evals, user feedback signals, implicit behavior signals — each tells a different part of the story. This post is the monitoring architecture we use to maintain quality in production AI systems and the alert patterns that actually fire meaningfully.
Continuous evals
Golden eval set. 100-1000 test cases representing real use. Run hourly or more frequently against production.
Regression detection. Any drop in eval pass rate triggers investigation. 5%+ drop pages on-call.
Eval set evolution. Add cases as new failure modes emerge. Retire cases that never fail (no longer informative).
See eval infra post for deeper dive on eval infrastructure.
User feedback signals
Thumbs up/down on AI outputs. Low-friction feedback; produces volume. Minimal context per signal; aggregate patterns matter.
Ratings on interactions. Multi-point (1-5 star) captures more nuance. Lower response rate than thumbs.
Edit/regenerate signals. User edits AI output or regenerates — implicit dissatisfaction with first output.
Complaint tickets. Explicit complaints to support about AI quality. Low volume, high signal.
Aggregation and trending. Daily, weekly, monthly rollups. Alert on significant drops.
Implicit signals
Task completion rate. Users complete the task they started. Drop in completion indicates quality or UX problem.
Session abandonment. Users leave mid-task. Where? Why? Heatmaps and funnel analysis surface patterns.
Retry/rephrase frequency. User tries the same or similar query multiple times. Often means first response wasn't useful.
Conversion rate. For commercial features, did the AI interaction lead to conversion (purchase, signup, upgrade)?
Alert design
Eval regression >5% = page. Requires immediate investigation. Model provider may have updated; prompt may need fixing.
User feedback trend downward for 2+ days = ticket. Slower signal; warrants investigation but not emergency.
Implicit signals trending negative = dashboard review weekly. Background signal; acts in aggregate.
Calibration. Alert thresholds tuned over time. Too sensitive = alert fatigue; too lax = missed regressions.
Root cause analysis
Provider model update. Most common cause of sudden eval regression. Check provider changelogs; test with pinned previous version.
Prompt change. Your prompt edits can cause quality changes. Version control prompts; canary new versions.
Data distribution shift. User input patterns changing. Ones model was trained on less common now.
RAG quality. For retrieval-augmented systems, retrieved content quality impacts generation quality. Index drift?
Cache corruption. Cached responses from old model version can contaminate current. Invalidation hygiene matters.
Tools
Observability platforms (Datadog, Honeycomb, New Relic) with custom metrics for AI quality.
Specialized AI observability (LangSmith, Arize, Humanloop, Braintrust). Purpose-built for LLM quality.
Eval platforms. LangSmith, Braintrust, Humanloop again. Continuous eval integration.
Custom dashboards. Often supplement or replace vendors. Team ownership and iteration speed arguments for building in-house.
Response patterns
Alert fires → investigate causes → root cause found → fix deployed → monitor recovery.
Rollback ready. Pinned previous model version; previous prompt version accessible. Fast rollback capability essential.
Postmortem process. Even for resolved quality issues. Learn; improve; reduce recurrence.
Customer communication. For major quality incidents, customer-facing comms warranted. Status page update, blog post for significant events.