eazyware
Engineering·September 18, 2023·11 min read

AI quality monitoring in production

Continuous eval, user feedback signals, quality regression detection. The monitoring patterns that catch quality problems before customers do.

KR
Kushal R.
Engineering lead

Quality monitoring for AI systems catches regressions before customers do. Continuous evals, user feedback signals, implicit behavior signals — each tells a different part of the story. This post is the monitoring architecture we use to maintain quality in production AI systems and the alert patterns that actually fire meaningfully.

Quality signals
AI quality monitoring — signals Continuous evals Golden set, hourly Regression detection Alert on threshold breach User feedback Thumbs, ratings Edit/regenerate signals Complaint tickets Implicit signals Task completion rate Session abandonment Retry/rephrase frequency Alerts and response Eval regression >5% = page; investigate prompt, model, data changes User feedback trend downward = ticket; investigate over days Implicit signals trending negative = dashboard review weekly
Continuous evals: golden set hourly, regression alerts. User feedback: thumbs, complaints, edit signals. Implicit: completion rate, abandonment, retries.

Continuous evals

Golden eval set. 100-1000 test cases representing real use. Run hourly or more frequently against production.

Regression detection. Any drop in eval pass rate triggers investigation. 5%+ drop pages on-call.

Eval set evolution. Add cases as new failure modes emerge. Retire cases that never fail (no longer informative).

See eval infra post for deeper dive on eval infrastructure.

User feedback signals

Thumbs up/down on AI outputs. Low-friction feedback; produces volume. Minimal context per signal; aggregate patterns matter.

Ratings on interactions. Multi-point (1-5 star) captures more nuance. Lower response rate than thumbs.

Edit/regenerate signals. User edits AI output or regenerates — implicit dissatisfaction with first output.

Complaint tickets. Explicit complaints to support about AI quality. Low volume, high signal.

Aggregation and trending. Daily, weekly, monthly rollups. Alert on significant drops.

Implicit signals

Task completion rate. Users complete the task they started. Drop in completion indicates quality or UX problem.

Session abandonment. Users leave mid-task. Where? Why? Heatmaps and funnel analysis surface patterns.

Retry/rephrase frequency. User tries the same or similar query multiple times. Often means first response wasn't useful.

Conversion rate. For commercial features, did the AI interaction lead to conversion (purchase, signup, upgrade)?

Alert design

Eval regression >5% = page. Requires immediate investigation. Model provider may have updated; prompt may need fixing.

User feedback trend downward for 2+ days = ticket. Slower signal; warrants investigation but not emergency.

Implicit signals trending negative = dashboard review weekly. Background signal; acts in aggregate.

Calibration. Alert thresholds tuned over time. Too sensitive = alert fatigue; too lax = missed regressions.

Root cause analysis

Provider model update. Most common cause of sudden eval regression. Check provider changelogs; test with pinned previous version.

Prompt change. Your prompt edits can cause quality changes. Version control prompts; canary new versions.

Data distribution shift. User input patterns changing. Ones model was trained on less common now.

RAG quality. For retrieval-augmented systems, retrieved content quality impacts generation quality. Index drift?

Cache corruption. Cached responses from old model version can contaminate current. Invalidation hygiene matters.

Tools

Observability platforms (Datadog, Honeycomb, New Relic) with custom metrics for AI quality.

Specialized AI observability (LangSmith, Arize, Humanloop, Braintrust). Purpose-built for LLM quality.

Eval platforms. LangSmith, Braintrust, Humanloop again. Continuous eval integration.

Custom dashboards. Often supplement or replace vendors. Team ownership and iteration speed arguments for building in-house.

Response patterns

Alert fires → investigate causes → root cause found → fix deployed → monitor recovery.

Rollback ready. Pinned previous model version; previous prompt version accessible. Fast rollback capability essential.

Postmortem process. Even for resolved quality issues. Learn; improve; reduce recurrence.

Customer communication. For major quality incidents, customer-facing comms warranted. Status page update, blog post for significant events.

Read next
Why evaluation infrastructure matters more than prompts
Read next
AI drift detection: catching silent model changes
Read next
AI usage analytics: what to measure, how to act on it
Tags
qualitymonitoringevals
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request