Disaster recovery for AI systems requires adapting standard DR patterns to AI-specific components: model serving, vector stores, training data, eval infrastructure, and provider dependencies. This post is the DR framework for AI workloads — what to replicate, at what frequency, with what RTO/RPO targets, and the testing cadence that keeps DR real.

Critical components

Model serving: multi-region, provider fallback, RTO minutes. Vector stores: cross-region replication, backup snapshots, RPO minutes-hours. Training data: object storage versioning, RTO hours-days.

Model serving DR

Multi-region deployment. Primary and secondary regions active. Traffic failover automated on primary outage.

Provider fallback. Anthropic primary, OpenAI secondary (or vice versa). Different failure modes across providers.

RTO target: minutes. User-facing serving must restore quickly. Automated failover essential.

RPO. Generally not applicable for stateless serving; sessions may be lost, acceptable for most AI use cases.

Vector store DR

Cross-region replication. Pinecone, Weaviate, Qdrant all support multi-region. Configure during initial setup, not during disaster.

Backup snapshots. Daily snapshots of vector indices and metadata. Retain 30-90 days typically.

RPO target: minutes to hours. Users can tolerate some loss of recently-indexed content in disaster; total loss not acceptable.

RTO target: hours. Reindexing is faster than cold restoration. Failover to secondary region within minutes.

Test restoration periodically. Backups that don't restore are worse than no backups.

Training data DR

Object storage with versioning. S3, GCS, Azure Blob all support versioning and replication. Enable both.

Dataset lineage tracking. Which dataset version trained which model? Track so you can reconstruct.

RTO target: hours to days. Training data loss is recoverable if lineage is preserved; models can be retrained.

RPO target: minutes to hours depending on data sensitivity. Replication lag determines actual RPO.

Eval infrastructure DR

Eval sets are intellectual property. Version controlled, backed up, replicated.

Eval harness code. Git-based; standard DR approach.

Historical eval results. Track over time; supports regression detection. Back up to object storage.

Provider risk considerations

Your DR plans depend on provider DR. Review provider SLAs and stated DR capabilities.

Multi-provider for critical workloads. Don't rely on single provider for mission-critical AI.

Provider switching cost. API differences, prompt differences, quality differences. Minimize lock-in where possible.

Contractual protections. SLA credits are compensation, not DR. Large contracts negotiate higher SLAs, financial protection.

DR testing

Quarterly: failover drills in staging + subset of production. Validate automation works; surface issues.

Annual: full regional failover exercise. Real traffic to secondary region for a period. Validates DR readiness comprehensively.

Game days. Team practices incident response for disaster scenarios. Builds muscle memory.

Measure RTO/RPO against targets. Adjust architecture when actuals miss targets.

Incident scenarios to plan for

Region-level outage (cloud provider). Failover to secondary region.

Model provider extended outage. Failover to secondary provider.

Data corruption. Restore from backup; validate data integrity.

Catastrophic model quality regression. Roll back to previous model version.

Security breach affecting data. Forensic investigation; affected data restoration; customer communication.

Communication during disaster

Status page. Real-time customer communication. Honest about impact and timeline.

Internal war room. Executive involvement; cross-functional coordination. Incident command structure.

Customer escalations. Proactive outreach to enterprise customers; support team briefed to handle inbound.

Postmortem publication. For major incidents, public postmortem builds trust. Candid about cause and remediation.

Cost considerations

DR infrastructure has ongoing cost. Multi-region doubles some infrastructure cost. Weigh against business impact of outage.

Tiered DR. Critical workloads multi-region; others single-region with backup. Don't over-invest in DR for low-criticality systems.

Testing cost. DR testing consumes capacity. Budget for this.

AI disaster recovery: model servers, vector stores, and data

Model serving DR

Vector store DR

Training data DR

Eval infrastructure DR

Provider risk considerations

DR testing

Incident scenarios to plan for

Communication during disaster

Cost considerations

Continue the thread.

SRE patterns for AI workloads

AI capacity planning: GPUs, tokens, and burst traffic

LLM observability without vendor lock-in

Want to talk about this?