eazyware
Engineering·August 21, 2023·10 min read

Chaos engineering for AI systems

Failure injection for provider outages, model quality regression, cost anomalies. Chaos practices that build confidence in AI system resilience.

KR
Kushal R.
Engineering lead

Chaos engineering for AI systems goes beyond random process killing. AI-specific scenarios — provider outage, quality regression injection, cost explosion simulation — need deliberate design. Game days and postmortems build team capability for real incidents. This post is AI-specific chaos practices and the scenarios worth rehearsing.

Chaos scenarios
Chaos scenarios for AI Provider outage Anthropic / OpenAI down Failover to secondary Monitor graceful degradation Quality regression Inject bad model response Verify detection, rollback Alerting fire test Cost explosion Simulate 10x traffic Rate limits engage Alerts fire, no runaway Game day practices Quarterly in staging; annual in production with safeguards Rotate which team member handles response — builds broad capability Postmortem each game day; surface gaps in runbooks, automation
Provider outage: force failover. Quality regression: inject bad responses. Cost explosion: simulate traffic spike; verify rate limits.

Provider outage scenarios

Block provider endpoint in staging or small prod slice. Observe: do requests fall back to secondary? Does quality maintain?

Measure fallback latency. Time from primary failure to secondary serving traffic. Should be seconds, not minutes.

Customer visibility. Do users see error messages? Graceful degradation? Critical to test under conditions of failure.

Run-book validation. Does on-call know what to do? Game day exposes gaps.

Quality regression injection

Inject deliberately degraded responses at small percentage. Observe: does eval monitoring detect? Does alerting fire? How quickly?

Simulate prompt drift. Deploy deliberately worse prompt version; observe detection timing.

Simulate model version change. Route some traffic to different model; observe quality metric changes.

Validates monitoring. If chaos doesn't trigger alerts, monitoring needs work.

Cost explosion simulation

Simulate 10x traffic burst. Verify rate limits engage; no runaway spending.

Simulate unusual request patterns. Long context inputs, high-frequency requests from single user. Verify abuse detection and rate limiting.

Cost alerts fire. Verify cost anomaly detection alerts properly.

Budget guardrails. Verify automatic throttling or shutoff when costs exceed budget.

Game day practice

Scheduled exercises. Quarterly in staging; annual in production with safeguards.

Scenarios developed in advance. Scripted; reviewed by leadership; communicated to team.

Rotate responder. Different team member handles response each game day. Builds broad capability; no single point of failure.

Facilitator role. Someone manages the game day; not the responder. Observes; coaches; ensures learning captured.

Time-boxed. 2-4 hours typical. Realistic without consuming too much team time.

Postmortem each exercise

What worked. Automation that fired correctly; team response that was efficient.

What didn't. Detection gaps; runbook errors; communication breakdowns.

Action items. Specific fixes with owners and deadlines.

Knowledge sharing. Learnings shared broadly beyond participants.

Organizational buy-in

Chaos engineering requires leadership support. Disruption risk creates friction without clear mandate.

Communicate before exercises. Team, dependent teams, customer-facing teams. No surprises.

Start small. Early game days in low-risk environments. Build trust and capability before larger exercises.

Quantify value. Show incidents avoided or detected faster due to chaos practice.

Tooling

Chaos Monkey style tools for AI. Not off-the-shelf; typically custom. Inject specific failure modes for AI systems.

Traffic shaping. Tools like Toxiproxy inject latency, errors, packet loss. Useful for provider failure simulation.

Observability integration. Chaos scenarios must integrate with monitoring. Observe while chaos occurs.

DR testing complements chaos. DR is planned failover; chaos is unplanned failure injection. See DR post.

Load testing and chaos are adjacent. Load test finds capacity limits; chaos injects failures. Combine for comprehensive testing.

Incident response training. Fire drills for your team. Real incidents find similar gaps.

Read next
SRE patterns for AI workloads
Read next
AI disaster recovery: model servers, vector stores, and data
Read next
Load testing AI systems: patterns and pitfalls
Tags
chaos engineeringresiliencetesting
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request