eazyware
Engineering·October 23, 2023·12 min read

SRE patterns for AI workloads

SLOs for probabilistic outputs, error budgets for quality regression, capacity planning for bursty AI workloads. SRE adapted for AI.

KR
Kushal R.
Engineering lead

SRE (site reliability engineering) principles apply to AI workloads, but AI-specific adaptations matter. SLOs for probabilistic outputs; error budgets for quality regression; capacity planning for bursty inference; graceful degradation across model paths. This post is how SRE craft adapts for AI systems in 2026.

SLO types
SLOs for AI systems Availability Request success rate 99.5% typical for AI Retries, fallbacks count Latency TTFT, p95, p99 Per-endpoint SLOs End-to-end streaming Quality Eval pass rate User feedback signals Regression alerts What's AI-specific Quality SLO: unique to AI; burn error budget when evals regress Provider-dependent availability: your SLO inherits theirs Graceful degradation: fallback model paths are first-class SLO element
Availability: request success rate. Latency: TTFT, p95, p99 per endpoint. Quality: eval pass rate, user feedback signals, regression alerts.

SLO categories for AI

Availability. Request success rate. 99.5% typical for AI (lower than enterprise SaaS; provider dependencies contribute). Retries, fallbacks count against the SLO or don't, depending on how you measure.

Latency. TTFT (time-to-first-token), p95, p99 per endpoint. Different features have different latency budgets. See latency budgeting post.

Quality. Unique to AI. Eval pass rate on continuous benchmark set. User feedback signals (thumbs, complaints). Model regression alerts.

Cost. Not traditionally an SLO, but AI systems benefit from cost as a measured reliability dimension. Sustained cost above target indicates problems.

Quality SLOs — AI-specific

Eval-based SLO. Continuous eval suite runs against production model. Target: 90% pass rate on eval set X.

When quality degrades. Model provider updated the model; prompts drifted; data distribution shifted; cache got corrupted. Many causes, same symptom.

Burn rate alerts. Quality SLO burning fast triggers alert. Like availability burn rate, but for model quality.

Error budget policy. When quality SLO burns through budget, what happens? Deployment freeze until investigation; rollback to previous model; heightened monitoring.

Graceful degradation

Model fallback paths. Primary (Opus) down → Sonnet. Sonnet down → Haiku. Haiku down → cached responses or error message.

Feature degradation. When full AI unavailable, degrade feature gracefully. Structured mode when chat unavailable; canned responses when generation unavailable.

Customer communication during degradation. Transparency about what's working; helps users work around limitations.

First-class SLO element. Graceful degradation should be tested, measured, alerted on. Not a guess at resilience.

Incident response

Severity levels. SEV-1 (major outage), SEV-2 (significant degradation), SEV-3 (minor issue). Escalation criteria clear.

Incident commander. Single person coordinates during incident. Not the engineers debugging; someone orchestrating.

Communication channels. Dedicated incident channel (Slack, Teams). Status page updates. Customer comms in pre-authored templates.

Postmortems. Blameless, timelined, with action items. See on-call post.

Capacity planning

Forecast traffic. Historical patterns, seasonal factors, business forecasts. Model scenarios (best case, worst case, base).

Token budget. Per-user or per-feature token targets; helps forecast costs and capacity.

Reserved vs burst capacity. Provider commitments lock in discounts; burst capacity costs more. Right mix depends on traffic variability.

See capacity planning post for detailed framework.

Chaos engineering for AI

Intentional failure injection. Simulate provider outages in staging. Verify fallback paths work.

Game days. Team practices incident response for AI-specific scenarios. Model provider outage, cost explosion, quality regression.

Disaster recovery drills. Annual full exercises; quarterly component tests. See DR post.

Observability

Distributed tracing. Across tool calls, retries, cache hits, model invocations. Understand actual request flow.

Model-specific metrics. Which model served each request? What latency, cost, cache hit? Feeds capacity and cost analysis.

Structured logging. Every inference with request ID, user ID, model version, prompt version. Debugging requires this.

Dashboards. Executive dashboard (business metrics, cost, SLO status); engineering dashboard (detailed metrics, alerts, traces).

Change management

Canary deployments. Small percentage of traffic to new version; monitor; roll forward or back.

Feature flags. Toggle AI features per user, percentage, segment. Gradual rollout reduces risk.

Change freezes during high-risk periods. Black Friday, financial quarter close, major events.

Model updates treated as changes. New model = change management process = canary + rollback plan.

Team structure

Dedicated SRE for AI. Some companies embed SRE in product teams; others centralize. Either works if coordination is good.

Shared on-call across AI and traditional services. Engineers on rotation need broad capability.

Platform engineering layer. Shared infrastructure, tooling, patterns. Individual teams don't reinvent basics.

Read next
On-call practices for AI systems
Read next
LLM observability without vendor lock-in
Read next
Self-hosted LLM monitoring: the metrics that matter
Tags
SREreliabilitySLO
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request