INTERVIEW_QUESTIONS

Disaster Recovery Interview Questions for Senior Engineers (2026)

Comprehensive disaster recovery interview questions with detailed answer frameworks covering RTO/RPO planning, failover strategies, data replication, chaos engineering, and business continuity patterns used at top technology companies.

20 min readUpdated Apr 20, 2026
interview-questionsdisaster-recoverysenior-engineerdistributed-systemsreliability

Why Disaster Recovery Matters in Senior Engineering Interviews

Disaster recovery is one of the most critical competencies evaluated in senior and staff engineering interviews at top technology companies. When systems serve millions of users and process billions of dollars in transactions, the ability to design for failure and recover gracefully separates senior engineers from those who only build for the happy path. Companies like Google and Amazon have learned through painful experience that disaster recovery cannot be an afterthought — it must be woven into the fabric of every system from day one.

Interviewers evaluating disaster recovery knowledge are looking for candidates who understand the full spectrum of failure modes: from single-node crashes to entire region outages, from data corruption to cascading failures across microservices. They want to see that you can quantify recovery objectives, design systems that meet those objectives under real-world constraints, and communicate trade-offs between cost, complexity, and resilience. A senior engineer must demonstrate that they can lead an organization through both the planning and execution phases of disaster recovery.

The questions in this guide cover the breadth of disaster recovery topics that appear in interviews at FAANG and top-tier companies. Each question includes the interviewer's intent, a structured answer framework, and follow-up questions you should be prepared to address. For broader preparation context, explore our system design interview guide and distributed systems guide. You can also find structured learning paths that build disaster recovery knowledge progressively.

1. How would you design a disaster recovery strategy for a globally distributed e-commerce platform?

Interviewer's Intent: This question evaluates your ability to reason about multi-region architectures, data consistency trade-offs, and prioritization of services during recovery. The interviewer wants to see that you understand business impact analysis and can translate business requirements into technical recovery objectives.

Answer Framework:

Begin by establishing the recovery objectives. For a platform like Amazon's e-commerce system, you need to define RTO (Recovery Time Objective) and RPO (Recovery Point Objective) for each service tier. Critical path services like checkout and payment processing might require an RTO of under 30 seconds and RPO of zero (no data loss), while recommendation engines might tolerate minutes of downtime and hours of data staleness.

The architecture should employ an active-active multi-region deployment where traffic is distributed across at least three geographic regions. Each region maintains a full copy of the application stack. The data layer is the most challenging component. For the product catalog and user sessions, you would use database replication with asynchronous replication across regions, accepting eventual consistency for non-critical reads. For financial transactions, you need synchronous replication or consensus-based writes using a system like Spanner or CockroachDB, which introduces latency but guarantees consistency per the CAP theorem.

Deploy a global traffic manager (like AWS Route 53 or Cloudflare) that performs health checks every 10 seconds and can redirect traffic away from a failing region within 30 seconds. Use a CDN to serve static assets from edge locations, providing resilience even when origin servers are degraded. Implement circuit breakers at every service boundary so that a failure in one service does not cascade to others.

For the data tier, implement a tiered backup strategy: continuous replication for hot standby, hourly snapshots for point-in-time recovery, and daily cold backups to a separate cloud provider for catastrophic scenarios. Test failover quarterly through chaos engineering exercises that simulate region-level outages.

Follow-up Questions:

  • How would you handle split-brain scenarios where two regions both believe they are the primary?
  • What is your strategy for replaying in-flight transactions that were lost during failover?
  • How do you validate data integrity after a region recovery?

2. Explain the difference between RTO and RPO and how you would determine appropriate values for different system tiers.

Interviewer's Intent: This tests your ability to connect technical metrics to business requirements and demonstrates that you can facilitate conversations between engineering and business stakeholders about acceptable risk levels.

Answer Framework:

RTO (Recovery Time Objective) defines the maximum acceptable duration of an outage — the time from when a disaster occurs to when the system is fully operational again. RPO (Recovery Point Objective) defines the maximum acceptable amount of data loss measured in time — if your RPO is one hour, you can afford to lose up to one hour of data.

Determining appropriate values requires a structured business impact analysis. Start by categorizing services into tiers based on revenue impact, user impact, and regulatory requirements. Tier 1 services (payment processing, authentication, core API) typically require RTO under 1 minute and RPO of zero. These services justify the cost of synchronous replication and active-active deployment. Tier 2 services (search, recommendations, notifications) might accept RTO of 5-15 minutes and RPO of 1-5 minutes, making asynchronous replication cost-effective. Tier 3 services (analytics, batch processing, internal tools) can tolerate RTO of hours and RPO of hours, where periodic backups and cold standby infrastructure suffice.

The key insight is that RTO and RPO have a direct relationship with cost. Achieving near-zero RTO and RPO requires active-active deployment with synchronous replication, which can double or triple infrastructure costs and add latency to every write operation. A senior engineer must articulate these trade-offs clearly and help the business make informed decisions. Document these decisions in a service-level agreement (SLA) that is reviewed quarterly as business requirements evolve.

When discussing this in an interview, reference real examples. Netflix's architecture accepts eventual consistency for viewing history (higher RPO) but requires immediate consistency for billing (zero RPO). This demonstrates that RPO is not a system-wide constant but varies per data domain.

Follow-up Questions:

  • How do you measure whether your actual recovery performance meets your stated RTO/RPO?
  • What happens when the cost of achieving a given RTO is prohibitive — how do you negotiate with stakeholders?
  • How do regulatory requirements (GDPR, SOX) influence RTO/RPO decisions?

3. How would you implement a failover strategy for a database system that handles both OLTP and OLAP workloads?

Interviewer's Intent: This evaluates your understanding of database architectures, replication topologies, and the unique challenges of mixed workloads during failover scenarios. The interviewer wants to see that you can reason about SQL vs NoSQL trade-offs in the context of resilience.

Answer Framework:

A system handling both OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) workloads requires a differentiated failover strategy because these workloads have fundamentally different characteristics and recovery priorities.

For the OLTP component, implement synchronous replication to a hot standby in the same region and asynchronous replication to a warm standby in a secondary region. Use a consensus protocol (Raft or Paxos) for automatic leader election so that failover happens without human intervention. The OLTP failover must complete within seconds because every second of downtime directly impacts user-facing transactions. Configure connection poolers (like PgBouncer or ProxySQL) to detect primary failure and redirect connections to the new primary within the connection timeout window.

For the OLAP component, you have more flexibility. OLAP queries are typically long-running and can be retried, so the RPO can be higher. Implement a read replica topology where OLAP queries are directed to dedicated replicas. If these replicas fail, OLAP workloads can be degraded gracefully — queue incoming analytical queries and process them once the replica is restored. Never allow OLAP failover traffic to overwhelm the OLTP primary, as this would turn a partial outage into a complete one.

The critical architectural decision is separation of concerns. Use database replication with change data capture (CDC) to feed OLAP systems asynchronously. This means OLTP failover does not need to consider OLAP state, and vice versa. Tools like Debezium can stream changes from the OLTP database to the OLAP data warehouse with configurable lag tolerance.

Implement automated runbooks that execute during failover: promote the standby, update DNS or service discovery, verify replication lag is within acceptable bounds, run data integrity checks on critical tables, and notify the on-call team. Practice these runbooks monthly through scheduled failover drills.

Follow-up Questions:

  • How do you handle long-running OLAP queries that were in progress when failover occurred?
  • What is your strategy for rebuilding a failed primary as a new replica without impacting production traffic?
  • How do you prevent replication lag from causing stale reads during normal operation?

4. Describe how you would design a chaos engineering program for an organization that has never practiced it before.

Interviewer's Intent: This assesses your leadership and organizational skills alongside technical knowledge. The interviewer wants to see that you can introduce cultural change, manage risk progressively, and build confidence in disaster recovery through scientific experimentation.

Answer Framework:

Introducing chaos engineering requires a phased approach that builds organizational confidence progressively. Start with the smallest blast radius possible and expand only as the team demonstrates competence and the systems demonstrate resilience.

Phase 1 (Months 1-2): Game Days and Table-Top Exercises. Before injecting any real failures, conduct tabletop exercises where the team walks through hypothetical failure scenarios on a whiteboard. Identify which systems have no documented recovery procedures, which have untested backups, and which have single points of failure. This creates organizational buy-in by revealing gaps without causing any production impact.

Phase 2 (Months 2-4): Controlled Experiments in Staging. Deploy chaos engineering tools (Chaos Monkey, Litmus, or Gremlin) in staging environments. Start with simple experiments: kill a single pod, introduce 100ms of network latency, fill a disk to 90%. Define steady-state hypotheses before each experiment — "if we kill one pod, the system should continue serving traffic with less than 5% error rate increase." Document results and fix discovered weaknesses before proceeding.

Phase 3 (Months 4-8): Production Experiments with Guard Rails. Begin production chaos experiments during business hours when the full engineering team is available. Start with the most resilient services first. Implement automated abort mechanisms that halt the experiment if error rates exceed defined thresholds. A good starting experiment is terminating a single instance of a stateless service behind a load balancer — this should be completely invisible to users if auto-scaling and health checks are configured correctly.

Phase 4 (Months 8+): Advanced Scenarios. Progress to multi-node failures, network partitions between services, region-level simulations, and dependency failures. At this stage, the organization should have enough confidence and tooling to run experiments weekly.

Throughout all phases, maintain an experiment log that tracks hypotheses, results, and improvements made. This builds an institutional knowledge base and justifies continued investment in resilience. Companies like Netflix have published extensively on how chaos engineering improved their reliability — reference these case studies when justifying the program to leadership.

Follow-up Questions:

  • How do you convince skeptical leadership that intentionally breaking production is worth the risk?
  • What metrics do you use to measure the ROI of a chaos engineering program?
  • How do you handle a chaos experiment that causes an unexpected customer-facing outage?

5. How would you design a backup and restore system that can recover petabytes of data within your RTO constraints?

Interviewer's Intent: This tests your knowledge of storage systems, data tiering, parallelization strategies, and the practical challenges of recovering large datasets. The interviewer wants to see that you understand the physics of data movement and can design around those constraints.

Answer Framework:

Recovering petabytes of data within aggressive RTO constraints requires a multi-layered approach because the laws of physics impose hard limits on data transfer rates. A single network link, no matter how fast, cannot transfer a petabyte in minutes. The strategy must involve parallelism, data tiering, and progressive recovery.

First, classify data into hot, warm, and cold tiers. Hot data (actively accessed by users in the last hour) might represent only 1-5% of the total dataset but serves 80% of requests. Design the recovery strategy so that hot data is recovered first, allowing the system to resume serving most user traffic while warm and cold data recovers in the background. This is progressive recovery — the system comes online incrementally rather than waiting for a complete restore.

For the hot tier, maintain synchronous replicas using replication strategies that keep a byte-for-byte copy available at all times. This effectively achieves zero-time recovery for the most critical data. For the warm tier, use asynchronous replication with a lag target measured in seconds to minutes. For the cold tier, use periodic snapshots stored in object storage (S3, GCS) across multiple regions.

When a full restore is necessary, parallelize aggressively. Sharding your data across hundreds or thousands of shards means each shard's backup is independently restorable. If you have 1000 shards each containing 1TB, you can restore all 1000 shards simultaneously across 1000 nodes, achieving an effective restore rate of 1000x a single node's throughput. This requires pre-provisioned compute capacity in your recovery region — you cannot wait for cloud provider provisioning during a disaster.

Implement incremental backups with write-ahead log (WAL) archiving so that point-in-time recovery requires restoring a base snapshot plus replaying a small set of incremental changes. This dramatically reduces restore time compared to full backups. Test restore procedures monthly, measuring actual restore times and comparing them against stated RTOs. Track restore time as a service-level indicator (SLI) with alerts when drift occurs.

Follow-up Questions:

  • How do you verify the integrity of backups before you need them in a disaster?
  • What is your strategy for handling schema migrations that make old backups incompatible with current application code?
  • How do you manage the cost of maintaining hot standby capacity that is only used during disasters?

6. What is the role of data replication in disaster recovery, and how do you choose between synchronous and asynchronous replication?

Interviewer's Intent: This evaluates your understanding of consistency-availability trade-offs and your ability to make pragmatic architectural decisions based on business requirements rather than purely technical preferences.

Answer Framework:

Database replication is the foundation of most disaster recovery strategies because it maintains copies of data in multiple locations, enabling rapid failover when the primary copy becomes unavailable. The choice between synchronous and asynchronous replication is fundamentally a trade-off between data safety and system performance, directly connected to the CAP theorem.

Synchronous replication guarantees that every write is confirmed by at least one replica before acknowledging success to the client. This provides RPO of zero — no committed data can be lost during failover. However, it introduces latency on every write operation (equal to the round-trip time to the replica) and creates an availability dependency — if the replica is unreachable, writes must either block or fail. For within-region replication (sub-millisecond latency), synchronous replication is almost always acceptable. For cross-region replication (50-200ms latency), it significantly impacts user experience for write-heavy workloads.

Asynchronous replication acknowledges writes immediately on the primary and ships changes to replicas in the background. This provides better write latency and availability but introduces a replication lag window during which data on replicas is stale. If the primary fails, any data that was committed but not yet replicated is lost. The RPO equals the maximum replication lag at the time of failure.

Semi-synchronous replication offers a middle ground: writes are acknowledged after at least one local replica confirms but before remote replicas confirm. This provides zero RPO for local disasters (server failure, rack failure) while accepting potential data loss for regional disasters.

The decision framework should consider: (1) What is the RPO requirement for this data? If zero, synchronous is mandatory. (2) What is the write latency budget? If sub-10ms, cross-region synchronous replication is infeasible. (3) What is the availability requirement? If 99.999%, synchronous replication to a remote site creates an additional failure mode. (4) What is the data volume? High-throughput systems may overwhelm synchronous replication links.

In practice, most production systems use a combination: synchronous replication within a region for high availability and asynchronous replication across regions for disaster recovery, accepting that a regional disaster may lose seconds of data.

Follow-up Questions:

  • How do you monitor replication lag and alert before it becomes dangerous?
  • What happens to in-flight transactions when you promote an asynchronous replica to primary?
  • How does conflict resolution work in multi-master replication setups?

7. How would you handle a disaster recovery scenario where your primary cloud provider experiences a complete outage?

Interviewer's Intent: This tests your thinking about cloud provider dependency, multi-cloud strategies, and the practical challenges of maintaining portability without over-engineering. The interviewer wants to see realistic assessment of trade-offs.

Answer Framework:

A complete cloud provider outage is a low-probability, high-impact event that requires careful cost-benefit analysis. The fully multi-cloud approach (running active-active across AWS, GCP, and Azure simultaneously) provides maximum resilience but introduces enormous complexity and cost. Most organizations should instead adopt a pragmatic middle ground.

The recommended strategy has three layers. First, design for portability without requiring active multi-cloud deployment. Containerize all workloads with Kubernetes, use infrastructure-as-code (Terraform) with provider-agnostic abstractions where possible, and avoid deep dependencies on proprietary managed services for critical path functionality. This means you could deploy to another cloud provider within hours to days, not weeks to months.

Second, maintain cold or warm standby infrastructure in a secondary cloud provider for your most critical services only. The top 3-5 services that generate revenue should have deployment manifests, pre-configured networking, and recent data snapshots available in the secondary provider. This reduces recovery time from days to hours for the services that matter most.

Third, use provider-independent edge services where possible. A CDN provider like Cloudflare or Fastly that is independent of your primary cloud can continue serving cached content and static assets even during a complete provider outage. DNS should be managed by a provider independent of your primary cloud.

The key insight to communicate in an interview is that multi-cloud disaster recovery is not binary. The question is not "are we multi-cloud or not?" but rather "which specific services justify the cost of multi-cloud redundancy, and what is our recovery time target for a full-provider outage?" For most organizations, the honest answer is: critical services recover in 4-8 hours via pre-prepared infrastructure in a secondary provider, while non-critical services accept 24-48 hours of downtime. This is dramatically cheaper than full active-active multi-cloud and provides adequate protection for an event with extremely low probability.

Follow-up Questions:

  • How do you keep secondary-provider infrastructure current without the ongoing cost of active usage?
  • What data portability challenges arise with provider-specific storage formats?
  • How do you handle authentication and secret management across cloud providers during failover?

8. Describe how you would implement automated failover detection and triggering without causing false positives.

Interviewer's Intent: This evaluates your understanding of distributed systems consensus, health checking strategies, and the dangers of split-brain scenarios. False positive failovers can be more damaging than the original outage.

Answer Framework:

Automated failover detection is one of the most dangerous automation problems in distributed systems. A false positive — triggering failover when the primary is actually healthy — can cause data loss, split-brain scenarios, or cascading failures that are worse than the original problem. The design must prioritize correctness over speed.

Implement multi-layer health checking with consensus-based decision making. Layer 1: Application-level health checks that verify the service can actually process requests (not just TCP connectivity). These should exercise the critical path — can the database execute a write and read it back? Layer 2: Infrastructure-level checks from multiple independent vantage points. A single health checker might have network issues to the target while the target is healthy. Require agreement from at least 3 independent checkers in different network locations before declaring a service unhealthy. Layer 3: Cross-verification with business metrics — if the service is reported unhealthy but business transactions are still succeeding, something is wrong with the health checking, not the service.

The decision algorithm should use a quorum approach inspired by consensus protocols. Define a failover decision committee of N observers (where N >= 5). Require a supermajority (e.g., 4 out of 5) to agree that the primary is unhealthy for a sustained period (e.g., 30 seconds) before triggering failover. This sustained-period requirement prevents transient network blips from triggering unnecessary failovers.

Implement a fencing mechanism to prevent split-brain. Before the standby promotes itself to primary, it must acquire a distributed lock (via a separate coordination service like ZooKeeper or etcd) and revoke the old primary's ability to accept writes. This can be implemented through STONITH (Shoot The Other Node In The Head) mechanisms or lease-based systems where the primary must continuously renew its lease to remain authoritative.

Finally, implement a progressive response. Not every health check failure should immediately trigger a full region failover. Design a graduated response: first retry, then circuit-break affected endpoints, then failover individual services, and only as a last resort failover an entire region. Each escalation level requires stronger evidence and more observer agreement.

Follow-up Questions:

  • How do you handle the scenario where the network is partitioned and both sides believe they should be primary?
  • What is your strategy for automatic failback once the original primary recovers?
  • How do you test failover automation without impacting production?

9. How do you design disaster recovery for stateful services like databases differently than stateless services?

Interviewer's Intent: This tests your understanding of the fundamental distinction between stateful and stateless systems and how state management complexity affects recovery strategies.

Answer Framework:

Stateless services and stateful services require fundamentally different disaster recovery approaches because their failure modes and recovery challenges are entirely different.

Stateless services (API gateways, web servers, computation workers) carry no persistent state between requests. Their disaster recovery is relatively straightforward: maintain sufficient capacity across multiple availability zones or regions, use health-check-based load balancing to route around failures, and ensure auto-scaling can provision replacement instances quickly. The recovery time is bounded by instance startup time and health check intervals — typically seconds to low minutes. The key design principles are immutable infrastructure (instances are replaced, never repaired), over-provisioning (maintain enough headroom to absorb an AZ failure without scaling events), and stateless design patterns (externalize all state to dedicated state stores).

Stateful services (databases, message queues, object stores, caches) are dramatically more complex because they hold data that cannot be regenerated from other sources. Their disaster recovery strategy must address: data durability (preventing data loss through replication), consistency (ensuring recovered state is coherent), and recovery ordering (some stateful services depend on other stateful services and must recover in the correct order).

For stateful services, the recovery strategy depends on the consistency requirement. For services requiring strong consistency (financial databases, inventory systems), implement synchronous replication with automatic leader election using Raft or Paxos consensus. The CAP theorem tells us this sacrifices some availability during network partitions. For services tolerating eventual consistency (user preferences, analytics stores, caches), asynchronous replication with conflict resolution provides better availability and performance at the cost of potential temporary inconsistency.

A critical consideration for stateful services is recovery ordering. If Service A's database contains foreign key references to Service B's database, Service B must be recovered first. Document these dependencies in a directed acyclic graph (DAG) and automate recovery orchestration to respect the ordering. This dependency mapping is often the most overlooked aspect of disaster recovery planning.

Follow-up Questions:

  • How do you handle services that are partially stateful (e.g., maintain local caches that affect behavior)?
  • What is your strategy for recovering message queues without losing or duplicating messages?
  • How do you test the recovery ordering of interdependent stateful services?

10. What strategies would you use to ensure data consistency across microservices after a disaster recovery event?

Interviewer's Intent: This evaluates your understanding of distributed data management, saga patterns, and the practical challenges of achieving consistency in a microservices architecture where each service owns its own database.

Answer Framework:

In a microservices architecture, each service typically owns its data store, making cross-service consistency one of the hardest problems during disaster recovery. Unlike a monolithic application with a single database where recovery is atomic, recovering a microservices system requires reconciling potentially inconsistent state across dozens of independent databases.

The first strategy is event-driven reconciliation. If services communicate through an event bus (Kafka, Pulsar), the event log serves as the source of truth. After recovery, each service can replay events from the last known consistent checkpoint to rebuild its state. This requires that the event bus itself has robust disaster recovery (Kafka's multi-datacenter replication via MirrorMaker 2 or Confluent Replicator). The event log provides a natural consistency boundary — if all services replay to the same event offset, they achieve consistency.

The second strategy is saga-based compensation. For business transactions that span multiple services, implement the saga pattern with explicit compensation logic. During disaster recovery, identify in-flight sagas (transactions that started but did not complete) and either complete them forward or compensate them backward. This requires a saga orchestrator that persists saga state independently and can resume incomplete sagas after recovery.

The third strategy is consistency checking and repair. After recovery, run automated consistency checkers that verify cross-service invariants. For example, if the Order Service says an order was placed, the Inventory Service should show a corresponding stock reservation. When inconsistencies are detected, use a deterministic resolution strategy: either the upstream service (source of the event) is authoritative, or a dedicated reconciliation service compares states and emits corrective events.

The fourth strategy is temporal decoupling. Design services so they can operate in a degraded mode with stale data from other services. If the Product Service is recovered but the Pricing Service is still recovering, the Product Service should serve products with cached prices rather than failing entirely. This requires each service to maintain its own materialized view of data it needs from other services, updated asynchronously via events.

Document the consistency model explicitly: which cross-service invariants are guaranteed, which are eventually consistent, and what is the maximum inconsistency window after recovery.

Follow-up Questions:

  • How do you handle idempotency when replaying events during recovery — preventing duplicate side effects like double-charging customers?
  • What is your strategy when the event bus itself is the failed component?
  • How do you prioritize which services to recover first in a large microservices system?

11. How would you design a disaster recovery plan that accounts for data corruption rather than infrastructure failure?

Interviewer's Intent: This tests whether you think beyond infrastructure failures to logical failures. Data corruption (from bugs, admin errors, or security breaches) requires different recovery mechanisms than hardware failures because replicas faithfully replicate corrupted data.

Answer Framework:

Data corruption is uniquely dangerous because it can go undetected for hours or days, and by the time it is discovered, all replicas and recent backups may contain the corrupted data. Traditional replication-based disaster recovery actually makes corruption worse by spreading it to every copy. This requires a fundamentally different approach.

The primary defense is point-in-time recovery (PITR) capability with sufficient retention. Maintain continuous backups (WAL archiving for databases, event log retention for streaming systems) going back at least 30 days. This allows you to identify the exact moment corruption was introduced and restore to the instant before it occurred. The granularity of your PITR determines your RPO for corruption events — WAL-based systems can recover to within a single transaction.

Implement immutable audit logs that record all write operations with full before/after state. Store these logs in append-only storage (write-once-read-many) that cannot be modified even by administrators. This serves dual purposes: forensic analysis to understand what went wrong, and surgical recovery where you can undo specific corrupted writes without rolling back everything.

Deploy continuous data integrity verification. Run background processes that validate data invariants, checksums, referential integrity, and business rule compliance. These checkers should run on a schedule (hourly for critical data) and alert immediately when violations are detected. The faster you detect corruption, the smaller the blast radius and the less data you need to recover.

For defense in depth, maintain at least one backup copy that is intentionally delayed by 24-48 hours. This "time-delayed replica" serves as a guaranteed clean copy for corruption events that are detected within the delay window. It should be read-only and isolated from the production network so that neither replication nor a compromised admin can reach it.

Finally, implement write-path validation that prevents obviously invalid data from being committed in the first place. Schema constraints, application-level validation, and anomaly detection on write patterns (sudden spike in deletes, unexpected bulk updates) can catch many corruption events at the point of introduction.

Follow-up Questions:

  • How do you handle a scenario where corruption is discovered after your PITR retention window has expired?
  • What is your strategy for partial recovery — restoring corrupted data without affecting non-corrupted data that has since been updated?
  • How do you prevent malicious insiders from corrupting data and destroying the audit logs?

12. Explain how you would conduct a disaster recovery drill for a production system without impacting users.

Interviewer's Intent: This evaluates your operational maturity and ability to balance the need for realistic testing with the requirement to protect production stability. The interviewer wants to see that you can design safe experiments that still provide meaningful confidence.

Answer Framework:

Disaster recovery drills must be realistic enough to build genuine confidence but controlled enough to avoid becoming the disaster they are meant to prevent. The key principle is progressive confidence building with strict blast radius controls.

Before any production drill, establish prerequisites: (1) A detailed runbook for the drill scenario. (2) A rollback plan for every step with clear abort criteria. (3) All on-call engineers aware and available. (4) Customer-facing dashboards monitored in real-time. (5) A designated drill conductor with authority to abort.

For database failover drills, the safest approach is to initiate a planned failover during a low-traffic window. Modern databases (RDS Multi-AZ, Cloud SQL HA) support planned failover that briefly pauses writes (typically 20-60 seconds) and promotes the standby. This is minimally impactful and tests the most critical recovery mechanism. Monitor connection errors, retry behavior, and application recovery after the failover. Verify that all applications reconnect successfully and that no data was lost by comparing pre-failover and post-failover checksums.

For region-level drills, use traffic engineering rather than actual infrastructure destruction. Gradually shift traffic away from the target region over 30 minutes (not instantaneously), monitor error rates and latency in the receiving regions, then verify the system operates correctly without the target region. Only after confirming stability, you can optionally stop services in the evacuated region to verify detection and alerting. This approach is safe because traffic was already shifted before anything was stopped.

For complete production chaos exercises, use the pattern pioneered by Netflix: inject failures that the system should handle transparently. If the drill causes user impact, the drill revealed a real gap. Document the gap, fix it, and re-run. Over time, you can inject progressively more severe failures as the system demonstrates resilience.

After every drill, conduct a retrospective: What worked? What was surprising? What would have failed in a real disaster that the drill did not test? Use these learnings to improve both the system and the next drill.

Follow-up Questions:

  • How often should disaster recovery drills be conducted, and how do you prevent drill fatigue?
  • How do you simulate a disaster that takes out your monitoring and alerting infrastructure?
  • What legal or compliance requirements affect how you conduct drills in regulated industries?

13. How do you design disaster recovery for systems that depend on third-party APIs and services you do not control?

Interviewer's Intent: This tests your ability to design resilient systems at organizational boundaries where you have limited control. The interviewer wants to see that you can protect your system from failures in external dependencies.

Answer Framework:

Third-party dependencies represent a unique disaster recovery challenge because you cannot implement failover, replication, or chaos engineering on systems you do not own. The strategy must focus on isolation, graceful degradation, and maintaining local capability when external services are unavailable.

The first principle is the Bulkhead Pattern: isolate third-party dependencies behind circuit breakers so that their failure does not cascade into your system. If a payment provider goes down, the checkout service should queue payments for later processing rather than failing the entire order flow. Implement timeout budgets and fallback behaviors for every external call.

The second principle is local caching of critical external data. If your system depends on a third-party for exchange rates, product data, or identity verification, maintain a local cache that can serve requests when the external service is unavailable. Define staleness tolerances — cached exchange rates might be acceptable for 1 hour, while identity verification results might be cached for 24 hours. For CDN dependencies, maintain origin serving capability so you can bypass the CDN entirely if needed.

The third principle is multi-provider strategies for critical capabilities. For payment processing, integrate with at least two providers (e.g., Stripe and Adyen) and implement automatic failover between them. For SMS delivery, integrate with multiple carriers. The additional integration cost is justified for capabilities where a third-party outage would directly prevent revenue generation.

The fourth principle is contractual and operational preparation. Maintain vendor SLAs that specify their disaster recovery commitments. Understand their planned maintenance windows. Subscribe to their status pages. For the most critical vendors, establish direct communication channels with their engineering teams so you get early warning of issues rather than discovering them through customer complaints.

Document each third-party dependency with: the business capability it provides, the degraded behavior when it is unavailable, the local cache duration, alternative providers if any, and the maximum tolerable outage duration before business impact becomes critical.

Follow-up Questions:

  • How do you handle a scenario where a third-party permanently shuts down their API with short notice?
  • What is your strategy for testing third-party failure modes without actually causing their service to fail?
  • How do you manage secrets and credentials rotation for third-party services during disaster recovery?

14. What is the role of observability in disaster recovery, and how would you design a monitoring system that remains operational during disasters?

Interviewer's Intent: This tests your understanding that monitoring is itself a critical system that must survive the disasters it is meant to help you recover from. The interviewer wants to see that you do not take observability for granted during failure scenarios.

Answer Framework:

Observability is the eyes and ears of disaster recovery — without it, you cannot detect disasters, assess their scope, make recovery decisions, or verify that recovery was successful. Ironically, many organizations deploy their monitoring infrastructure alongside their application infrastructure, meaning that the same disaster that takes down the application also blinds the team trying to recover it.

The monitoring system must be architecturally independent from the systems it monitors. This means: (1) Deploy monitoring in a separate availability zone or region from the primary application infrastructure. (2) Use a separate cloud account or even a separate cloud provider to avoid shared-fate failures. (3) Ensure network connectivity between the monitoring system and the monitored infrastructure does not share physical paths with production traffic.

Design the monitoring system with its own disaster recovery strategy. Use multi-region deployment for your observability stack (Prometheus, Grafana, or your chosen tools). Ensure alerting channels (PagerDuty, Opsgenie) are independent of your primary infrastructure — you should receive alerts even if your entire primary region is down. Implement heartbeat monitoring where the monitoring system is itself monitored by an external service that alerts if it goes silent.

During disaster recovery, observability serves four critical functions: Detection (automated alerts that trigger the disaster recovery process), Assessment (dashboards showing the scope and severity of the failure), Guidance (real-time metrics showing which recovery steps are progressing and which are blocked), and Verification (confirmation that recovered systems are healthy and serving traffic correctly).

Pre-build disaster-specific dashboards that show exactly the metrics needed during recovery: replication lag, data integrity check results, traffic distribution across regions, error rates by service, and recovery progress indicators. The team should not be building dashboards during a disaster — they should already exist and be tested during drills.

Maintain offline runbooks (PDF, printed copies) that document recovery procedures without requiring access to any online system. When everything is down, the team needs to know what to do without depending on Confluence or Notion being available.

Follow-up Questions:

  • How do you handle alert storms during a major outage without desensitizing the on-call team?
  • What metrics are most important to track during active disaster recovery versus normal operations?
  • How do you maintain observability into systems during a network partition?

15. How would you design a disaster recovery architecture that meets compliance requirements (SOC 2, HIPAA, PCI-DSS) while remaining operationally practical?

Interviewer's Intent: This evaluates your ability to navigate the intersection of technical architecture and regulatory requirements. The interviewer wants to see that you can satisfy auditors without creating an operationally unusable system.

Answer Framework:

Compliance frameworks like SOC 2, HIPAA, and PCI-DSS all include requirements around disaster recovery, but they differ significantly in specificity and enforcement. The key is to design a system that satisfies the strictest applicable requirements while remaining operationally practical for the engineering team.

SOC 2 requires that you have documented disaster recovery plans, that you test them regularly, and that you can demonstrate recovery within stated objectives. It does not prescribe specific technical implementations. HIPAA requires that electronic protected health information (ePHI) is recoverable and that you have contingency plans for emergencies. PCI-DSS requires that cardholder data environments have tested disaster recovery procedures with specific documentation requirements.

The architecture should address several cross-cutting compliance concerns. First, encryption: all data at rest and in transit must be encrypted, including backups, replicas, and data in the recovery environment. Key management must be independent of the primary infrastructure — if your KMS is unavailable during disaster, you cannot decrypt your backups. Maintain key escrow in a separate system or HSM.

Second, access control: during disaster recovery, there is a natural temptation to bypass access controls for speed. The architecture must provide emergency access procedures (break-glass) that are fully audited. Pre-provision emergency access credentials stored in a physical safe or separate secret management system. Every action taken during recovery must be logged to an immutable audit trail for post-incident compliance review.

Third, data residency: many regulations restrict where data can be stored geographically. Your disaster recovery region must comply with the same data residency requirements as your primary region. This may restrict which regions you can use for replication and backup storage, potentially eliminating some geographic options.

Fourth, testing documentation: all compliance frameworks require evidence that DR procedures are tested regularly. Design your drill program to produce artifacts that satisfy auditors: dated test reports, success/failure metrics, remediation timelines for discovered gaps, and sign-off from responsible parties. Maintain these records for the audit retention period (typically 7 years for financial regulations).

The practical approach is to implement disaster recovery that satisfies operational needs first, then layer compliance controls on top. A system that is auditable but cannot actually recover from disasters serves no one. Start with the technical architecture, then validate it against each applicable compliance framework and add controls where gaps exist.

For pricing considerations, factor in that compliance-grade disaster recovery typically adds 30-50% to infrastructure costs due to encryption overhead, audit logging volume, dedicated compliance environments, and the requirement for geographically restricted redundancy.

Follow-up Questions:

  • How do you handle a disaster recovery event that spans regulatory jurisdictions (e.g., failing over European user data to a US region)?
  • What documentation do you produce during a disaster for post-incident compliance review?
  • How do you balance the speed of recovery with the requirement to maintain access controls and audit trails?

Common Mistakes in Disaster Recovery Interviews

  1. Focusing only on infrastructure failures while ignoring data corruption, security breaches, and human error. Real disasters are more often caused by a bad deployment, an accidental deletion, or a compromised credential than by hardware failure. Your answer must address logical failures, not just physical ones.

  2. Proposing solutions without discussing cost trade-offs. Every disaster recovery mechanism has a cost — both in infrastructure spend and operational complexity. Stating that everything should have zero RTO and zero RPO without acknowledging the cost implications signals inexperience. Senior engineers must demonstrate cost awareness.

  3. Ignoring the human element of disaster recovery. The most elegant automated failover system is useless if the team does not know it exists, cannot interpret its alerts, or does not trust it enough to let it operate. Discuss runbooks, training, drills, and communication plans alongside technical architecture.

  4. Treating disaster recovery as a one-time project rather than a continuous practice. Systems change constantly — new services are added, data volumes grow, dependencies shift. A disaster recovery plan that was valid six months ago may have critical gaps today. Emphasize continuous validation and evolution.

  5. Over-engineering for extremely unlikely scenarios while leaving common failure modes unaddressed. Designing multi-cloud active-active for a hypothetical simultaneous outage of all AWS regions while not having tested single-AZ failover demonstrates misaligned priorities. Address common failures first, then layer protection against progressively less likely events.

How to Prepare for Disaster Recovery Interview Questions

Start by building a mental model of the failure spectrum: from single-process crashes (recovered in milliseconds by auto-restart) through node failures (recovered in seconds by load balancer health checks) through AZ failures (recovered in minutes by multi-AZ deployment) through region failures (recovered in minutes to hours by multi-region architecture) through provider failures (recovered in hours to days by multi-cloud preparation). Each level has different mechanisms, costs, and testing approaches.

Study real-world postmortems from companies like Google, Amazon, and Cloudflare. Understand not just what failed but how it was detected, how recovery was executed, and what was learned. These case studies provide concrete examples you can reference in interviews.

Practice articulating RTO/RPO decisions for different data types and services. Be prepared to justify why one service deserves 30-second RTO while another can tolerate 4 hours. The justification should always tie back to business impact — revenue per minute of downtime, user trust implications, regulatory penalties.

Familiarize yourself with the disaster recovery capabilities of major cloud providers (AWS, GCP, Azure) and major databases (PostgreSQL, MySQL, DynamoDB, Cassandra). Understanding the tools available allows you to propose concrete implementations rather than abstract concepts.

Explore our distributed systems guide for foundational concepts and the learning paths for structured preparation. Understanding replication, sharding, and the CAP theorem provides the theoretical foundation for disaster recovery architecture.

Related Resources

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.