Chaos engineering for AI systems goes beyond random process killing. AI-specific scenarios — provider outage, quality regression injection, cost explosion simulation — need deliberate design. Game days and postmortems build team capability for real incidents. This post is AI-specific chaos practices and the scenarios worth rehearsing.
Provider outage scenarios
Block provider endpoint in staging or small prod slice. Observe: do requests fall back to secondary? Does quality maintain?
Measure fallback latency. Time from primary failure to secondary serving traffic. Should be seconds, not minutes.
Customer visibility. Do users see error messages? Graceful degradation? Critical to test under conditions of failure.
Run-book validation. Does on-call know what to do? Game day exposes gaps.
Quality regression injection
Inject deliberately degraded responses at small percentage. Observe: does eval monitoring detect? Does alerting fire? How quickly?
Simulate prompt drift. Deploy deliberately worse prompt version; observe detection timing.
Simulate model version change. Route some traffic to different model; observe quality metric changes.
Validates monitoring. If chaos doesn't trigger alerts, monitoring needs work.
Cost explosion simulation
Simulate 10x traffic burst. Verify rate limits engage; no runaway spending.
Simulate unusual request patterns. Long context inputs, high-frequency requests from single user. Verify abuse detection and rate limiting.
Cost alerts fire. Verify cost anomaly detection alerts properly.
Budget guardrails. Verify automatic throttling or shutoff when costs exceed budget.
Game day practice
Scheduled exercises. Quarterly in staging; annual in production with safeguards.
Scenarios developed in advance. Scripted; reviewed by leadership; communicated to team.
Rotate responder. Different team member handles response each game day. Builds broad capability; no single point of failure.
Facilitator role. Someone manages the game day; not the responder. Observes; coaches; ensures learning captured.
Time-boxed. 2-4 hours typical. Realistic without consuming too much team time.
Postmortem each exercise
What worked. Automation that fired correctly; team response that was efficient.
What didn't. Detection gaps; runbook errors; communication breakdowns.
Action items. Specific fixes with owners and deadlines.
Knowledge sharing. Learnings shared broadly beyond participants.
Organizational buy-in
Chaos engineering requires leadership support. Disruption risk creates friction without clear mandate.
Communicate before exercises. Team, dependent teams, customer-facing teams. No surprises.
Start small. Early game days in low-risk environments. Build trust and capability before larger exercises.
Quantify value. Show incidents avoided or detected faster due to chaos practice.
Tooling
Chaos Monkey style tools for AI. Not off-the-shelf; typically custom. Inject specific failure modes for AI systems.
Traffic shaping. Tools like Toxiproxy inject latency, errors, packet loss. Useful for provider failure simulation.
Observability integration. Chaos scenarios must integrate with monitoring. Observe while chaos occurs.
Related practices
DR testing complements chaos. DR is planned failover; chaos is unplanned failure injection. See DR post.
Load testing and chaos are adjacent. Load test finds capacity limits; chaos injects failures. Combine for comprehensive testing.
Incident response training. Fire drills for your team. Real incidents find similar gaps.