Disaster recovery for AI systems requires adapting standard DR patterns to AI-specific components: model serving, vector stores, training data, eval infrastructure, and provider dependencies. This post is the DR framework for AI workloads — what to replicate, at what frequency, with what RTO/RPO targets, and the testing cadence that keeps DR real.
Model serving DR
Multi-region deployment. Primary and secondary regions active. Traffic failover automated on primary outage.
Provider fallback. Anthropic primary, OpenAI secondary (or vice versa). Different failure modes across providers.
RTO target: minutes. User-facing serving must restore quickly. Automated failover essential.
RPO. Generally not applicable for stateless serving; sessions may be lost, acceptable for most AI use cases.
Vector store DR
Cross-region replication. Pinecone, Weaviate, Qdrant all support multi-region. Configure during initial setup, not during disaster.
Backup snapshots. Daily snapshots of vector indices and metadata. Retain 30-90 days typically.
RPO target: minutes to hours. Users can tolerate some loss of recently-indexed content in disaster; total loss not acceptable.
RTO target: hours. Reindexing is faster than cold restoration. Failover to secondary region within minutes.
Test restoration periodically. Backups that don't restore are worse than no backups.
Training data DR
Object storage with versioning. S3, GCS, Azure Blob all support versioning and replication. Enable both.
Dataset lineage tracking. Which dataset version trained which model? Track so you can reconstruct.
RTO target: hours to days. Training data loss is recoverable if lineage is preserved; models can be retrained.
RPO target: minutes to hours depending on data sensitivity. Replication lag determines actual RPO.
Eval infrastructure DR
Eval sets are intellectual property. Version controlled, backed up, replicated.
Eval harness code. Git-based; standard DR approach.
Historical eval results. Track over time; supports regression detection. Back up to object storage.
Provider risk considerations
Your DR plans depend on provider DR. Review provider SLAs and stated DR capabilities.
Multi-provider for critical workloads. Don't rely on single provider for mission-critical AI.
Provider switching cost. API differences, prompt differences, quality differences. Minimize lock-in where possible.
Contractual protections. SLA credits are compensation, not DR. Large contracts negotiate higher SLAs, financial protection.
DR testing
Quarterly: failover drills in staging + subset of production. Validate automation works; surface issues.
Annual: full regional failover exercise. Real traffic to secondary region for a period. Validates DR readiness comprehensively.
Game days. Team practices incident response for disaster scenarios. Builds muscle memory.
Measure RTO/RPO against targets. Adjust architecture when actuals miss targets.
Incident scenarios to plan for
Region-level outage (cloud provider). Failover to secondary region.
Model provider extended outage. Failover to secondary provider.
Data corruption. Restore from backup; validate data integrity.
Catastrophic model quality regression. Roll back to previous model version.
Security breach affecting data. Forensic investigation; affected data restoration; customer communication.
Communication during disaster
Status page. Real-time customer communication. Honest about impact and timeline.
Internal war room. Executive involvement; cross-functional coordination. Incident command structure.
Customer escalations. Proactive outreach to enterprise customers; support team briefed to handle inbound.
Postmortem publication. For major incidents, public postmortem builds trust. Candid about cause and remediation.
Cost considerations
DR infrastructure has ongoing cost. Multi-region doubles some infrastructure cost. Weigh against business impact of outage.
Tiered DR. Critical workloads multi-region; others single-region with backup. Don't over-invest in DR for low-criticality systems.
Testing cost. DR testing consumes capacity. Budget for this.