INTERVIEW_QUESTIONS

High Availability Interview Questions for Senior Engineers (2026)

Essential high availability interview questions with structured answer frameworks covering redundancy, failover, SLAs, disaster recovery, and resilience patterns used at Google, Netflix, and AWS.

20 min readUpdated Apr 20, 2026
interview-questionshigh-availabilitysenior-engineerdistributed-systemsreliability

Why High Availability Matters in Senior Engineering Interviews

High availability is the defining challenge that separates hobby projects from production systems serving millions of users. When an interviewer asks about high availability, they are testing whether you can design systems that continue operating correctly despite hardware failures, software bugs, network partitions, and human errors. For senior engineering candidates, this is not theoretical knowledge but a daily operational reality.

At companies like Google, Netflix, and Amazon, systems are expected to maintain 99.99 percent availability or higher, which translates to less than 53 minutes of downtime per year. Achieving this level of reliability requires intentional architectural choices at every layer: redundancy to handle component failures, monitoring to detect issues before users notice, automated failover to recover without human intervention, and graceful degradation to maintain partial functionality during severe incidents.

Interviewers want to see that you can reason quantitatively about availability (calculating compound availability from component reliabilities), make practical trade-offs between availability, consistency, and cost, and apply patterns from real-world systems to new problems. The questions below cover the full spectrum from foundational SLA math through distributed consensus to operational practices. For broader preparation context, see our system design interview guide and explore structured learning paths for senior reliability engineers.

1. Explain the relationship between availability percentages and actual downtime. How do you calculate compound system availability?

What the interviewer is really asking: Can you do the quantitative reasoning that underlies all availability engineering, and do you understand that system availability is worse than its least available component?

Answer framework:

Availability is expressed as a percentage of time the system is operational. The industry uses the concept of nines: 99.9 percent (three nines) means 8.76 hours of downtime per year or 43.8 minutes per month. 99.99 percent (four nines) means 52.6 minutes per year or 4.38 minutes per month. 99.999 percent (five nines) means 5.26 minutes per year or 26.3 seconds per month.

For systems with components in series (all must work for the system to function), compound availability is the product. If your web server has 99.99 percent availability and your database has 99.95 percent availability, the system availability is 0.9999 * 0.9995 = 0.9994 (99.94 percent). Each additional serial component reduces overall availability. A system with 10 components each at 99.99 percent has compound availability of 0.9999^10 = 99.9 percent. This is why microservice architectures with long call chains require extremely high per-service availability.*

For systems with components in parallel (any one working means the system works), availability is 1 - (probability all fail). Two servers each at 99 percent availability give 1 - (0.01 * 0.01) = 99.99 percent when running in active-active parallel. This is the fundamental argument for redundancy: two mediocre components in parallel outperform one excellent component.*

Real systems combine serial and parallel components. Calculate bottom-up: first compute the availability of each redundant group (parallel), then multiply across the serial chain. For example, a system with a redundant load balancer pair (each 99.9 percent), three application servers (each 99.9 percent, any one sufficient), and a primary-replica database pair (each 99.9 percent) has availability of (1-0.001^2) * (1-0.001^3) * (1-0.001^2) = approximately 99.9998 percent.

Discuss the business context: moving from three nines to four nines is not simply 10x better. It requires fundamentally different architecture, operational practices, and investment. The cost curve is exponential. Align availability targets with business impact: a social media feed at three nines is fine (users refresh), but a payment system at three nines loses transactions and customer trust.

Address the measurement challenge: how do you measure actual availability? Define what constitutes being up (serving requests with latency under X and error rate under Y) and measure continuously with synthetic probes. Distinguish between planned downtime and unplanned outages in SLA calculations.

Follow-up questions:

  • How does the CAP theorem constrain your availability targets during network partitions?
  • What is the difference between availability and durability, and why does it matter?
  • How would you set availability targets for a new service with no historical data?

2. Design a highly available database system that can survive the loss of an entire availability zone.

What the interviewer is really asking: Do you understand multi-AZ replication architectures, the consistency trade-offs they introduce, and the operational complexity of cross-zone failover?

Answer framework:

The requirement is that the database remains fully operational when an entire data center (availability zone) fails. This requires data to be synchronously or asynchronously replicated across multiple AZs and a failover mechanism to redirect traffic.

For the replication topology, deploy a primary in AZ-1 and synchronous replicas in AZ-2 and AZ-3. Synchronous replication means a write is not acknowledged until it is persisted in at least two AZs. This guarantees zero data loss (RPO = 0) during an AZ failure but adds write latency (cross-AZ latency is typically 1-3ms within a region). Use semi-synchronous replication as a compromise: wait for one replica acknowledgment, making the commit latency equal to the cross-AZ round-trip.

For failover detection, implement multiple health check mechanisms. Application-level health checks test query execution (not just TCP connectivity). Use a consensus-based failure detection: multiple monitoring nodes must agree that the primary has failed before triggering failover. This prevents split-brain scenarios where a network glitch causes both the primary and a replica to accept writes simultaneously.

For automatic failover, use a leader election mechanism. When the primary is detected as failed, the remaining replicas participate in an election. The replica with the most up-to-date data wins (check the replication lag). The elected replica promotes itself to primary, and the connection routing layer (DNS, proxy, or service mesh) redirects traffic to the new primary. Target failover time: under 30 seconds for database infrastructure.

Discuss the CAP theorem trade-off explicitly: during the failover window (primary down, new primary not yet elected), the system is unavailable for writes. You cannot achieve both perfect consistency and perfect availability during a partition. Most production systems choose consistency (refuse writes during failover) over availability (accept writes that might conflict).

For the routing layer, use a database proxy (like PgBouncer, ProxySQL, or a custom proxy) that health-checks the primary and automatically routes to the new primary after failover. Alternatively, use DNS failover with low TTLs (30 seconds), but note that many clients cache DNS beyond the TTL.

Address data consistency after failover: if using asynchronous replication, the new primary might be slightly behind. Transactions committed on the old primary but not yet replicated are lost. This is the RPO (Recovery Point Objective) trade-off. Document this for stakeholders. For financial systems, this data loss is unacceptable, so synchronous replication is mandatory despite the latency cost.

Follow-up questions:

  • How do you handle the old primary coming back online after a new primary has been elected?
  • What is the impact of synchronous replication on write throughput during cross-AZ network degradation?
  • How would you extend this design to survive an entire region failure?

3. How do you design a system that degrades gracefully under load rather than failing catastrophically?

What the interviewer is really asking: Do you understand load shedding, graceful degradation, and the operational patterns that prevent cascading failures?

Answer framework:

Catastrophic failure occurs when a system under stress enters a death spiral: increased load causes increased latency, which causes increased retries, which causes more load. Graceful degradation means the system deliberately reduces functionality to maintain core capabilities.

Implement a degradation hierarchy with predefined levels. Level 0 (normal) is full functionality. Level 1 (elevated load) disables non-essential features like recommendation panels, real-time analytics, and decorative API calls. Level 2 (high load) serves cached or simplified responses, reduces personalization, and limits search results. Level 3 (critical) serves only essential functionality (authentication, core reads and writes) with static fallbacks for everything else.

The triggering mechanism: monitor system saturation signals (CPU, memory, queue depths, latency percentiles, error rates) and automatically transition between degradation levels. Use hysteresis: escalate quickly (one violation triggers escalation) but de-escalate slowly (require sustained health for 5 minutes before reducing the level). This prevents oscillation.

Implement load shedding at the application layer. When the system is above capacity, reject a percentage of requests rather than attempting to serve all of them poorly. Use priority-based shedding: assign each request type a priority (authentication is highest, analytics tracking is lowest). During overload, shed lowest-priority requests first. This is related to how load balancing works in distributing traffic to healthy backends.

For queue-based systems, implement deadline-based processing. Each request carries a deadline (the maximum time the client is willing to wait). If a request has been in the queue longer than its deadline, drop it rather than processing it (the client has already timed out and moved on). Processing stale requests wastes resources that could serve fresh requests.

Discuss the circuit breaker pattern for downstream dependencies. When a downstream service is failing, stop calling it rather than accumulating timeouts. The circuit breaker transitions: closed (normal), open (failing, return fallback immediately), half-open (test with one request). This prevents one failing service from consuming all your connection pool and thread resources.

Address the human element: graceful degradation requires organizational alignment. Product managers must pre-approve what functionality can be disabled. Engineers must implement feature flags for every degradation level. Operations must have runbooks for manual escalation and de-escalation. Practice degradation regularly (like Netflix's Chaos Monkey approach).

Follow-up questions:

  • How do you test graceful degradation without impacting production users?
  • What is the relationship between graceful degradation and the bulkhead pattern?
  • How would you implement graceful degradation in a system with strict SLA commitments to different customer tiers?

4. Explain the differences between active-active and active-passive high availability configurations. When would you use each?

What the interviewer is really asking: Do you understand the fundamental HA topologies and can you make practical decisions about which to apply based on specific requirements?

Answer framework:

Active-passive (also called primary-standby): one instance handles all traffic while one or more standbys remain idle, ready to take over if the primary fails. The standby continuously replicates state from the primary but serves no user traffic. Failover time depends on detection speed and promotion time (typically 10-60 seconds).

Active-active: all instances simultaneously handle traffic. There is no failover in the traditional sense. If one instance fails, the remaining instances absorb its traffic with no interruption (other than a brief spike in load on surviving instances). This provides both higher availability (no failover window) and better resource utilization (all instances serve traffic).

Use active-passive when strong consistency is required and the workload does not easily partition. Examples: relational databases with ACID requirements (only one writer prevents conflicts), systems with global state that cannot be partitioned (license servers, coordination services), and workloads where the cost of running a second active instance exceeds the cost of brief downtime.

Use active-active when you need zero-downtime failover, when the workload can be partitioned or replicated, and when you want to maximize resource efficiency. Examples: stateless web servers behind a load balancer, read-heavy database workloads (multiple replicas serving reads), content delivery with multiple edge nodes, and services where eventual consistency is acceptable.

For active-active with state, the core challenge is conflict resolution. If two instances accept concurrent writes to the same data, how do you resolve conflicts? Approaches: partition the data so each instance owns a non-overlapping subset (no conflicts possible), use last-writer-wins with vector clocks (simple but lossy), use CRDTs for automatic conflict-free merging (complex but correct), or use distributed consensus for strong consistency (correct but adds latency).

Discuss the hybrid approach (common in practice): active-active for stateless services and read traffic, active-passive for the authoritative write path. This gives you fast reads from multiple locations with consistent writes through a single leader.

Address the cost dimension: active-passive wastes resources (the standby is idle, sometimes called warm standby). Mitigate by using the standby for read traffic (active for reads, passive for writes), running batch jobs on the standby, or using smaller standby instances with auto-scaling triggered on promotion.

Follow-up questions:

  • How do you handle the split-brain problem in active-passive systems?
  • What is the operational complexity difference between managing active-active versus active-passive?
  • How does the choice between active-active and active-passive affect your disaster recovery strategy?

5. How would you design a health checking and failure detection system for a microservices architecture?

What the interviewer is really asking: Can you design a failure detection system that is fast enough to minimize impact but careful enough to avoid false positives that cause unnecessary failovers?

Answer framework:

Failure detection is the foundation of high availability: you cannot recover from a failure you have not detected. The challenge is the trade-off between detection speed (minimize time-to-detect) and accuracy (minimize false positives that trigger unnecessary failovers).

Implement multi-layer health checking. Layer 1 is infrastructure health (can the host reach the network, is the process running, is the port listening). These are necessary but insufficient: a process can be listening but returning errors. Layer 2 is application health (can the service handle requests, are its dependencies accessible, is latency within bounds). Implement a /health endpoint that checks database connectivity, cache connectivity, and performs a synthetic transaction. Layer 3 is business health (is the service producing correct results, are business metrics within expected ranges). Check output quality, not just availability.

For the health check protocol, use both pull (external checker periodically probes the service) and push (service periodically reports its own health). Pull is simpler to implement and works for services that are completely crashed. Push allows richer health information and catches issues earlier (the service can report degradation before complete failure).

For false positive mitigation, use progressive failure detection. A single failed health check should not trigger failover. Implement a failure threshold: require N consecutive failures (typically 3-5) before declaring an instance unhealthy. Use different thresholds for different actions: 3 failures to remove from load balancer rotation, 10 failures to trigger a restart, 30 failures to page an engineer.

Discuss the Phi Accrual failure detector algorithm (used by Akka and Cassandra): instead of a binary healthy/unhealthy decision, it outputs a suspicion level (phi value) based on the statistical distribution of heartbeat intervals. If the current interval is many standard deviations above the mean, the suspicion level is high. This adapts automatically to network conditions without manual threshold tuning.

For distributed failure detection, individual health checks from a single monitor can fail due to network issues between the monitor and the target (not a target failure). Use multi-observer consensus: deploy health checkers in multiple locations, and only declare failure when a majority agree. This is the approach used by global monitoring services and relates to distributed consensus principles.

Address cascading detection: when a database becomes unhealthy, all services depending on it will also fail health checks. Implement dependency-aware health checking that distinguishes between direct service failure (the service itself is broken) and transitive failure (the service is healthy but a dependency is not). Report both but trigger different remediation actions.

Follow-up questions:

  • How do you handle health check endpoints that are healthy but the service is actually serving errors to real traffic?
  • What is the optimal health check interval and how do you determine it?
  • How would you implement health checking for a service that has long-running requests (minutes) and cannot respond to health checks during processing?

6. Design a system that can perform zero-downtime deployments while maintaining high availability.

What the interviewer is really asking: Do you understand deployment strategies that maintain availability during code changes, which is often the highest-risk period for outages?

Answer framework:

Deployments are the most common cause of outages at well-run companies. A zero-downtime deployment strategy must handle: draining active connections, database schema changes, backward compatibility, and rollback.

Rolling deployment: update instances one at a time (or in small batches). At any point during deployment, some instances run the old version and some run the new version. Requirements: the new version must be backward-compatible with the old version (both versions serve traffic simultaneously), the load balancer must drain connections from an instance before updating it (send a signal, wait for active requests to complete, then stop the instance), and health checks must pass on the new version before proceeding to the next batch.

Blue-green deployment: maintain two identical production environments (blue and green). Deploy the new version to the inactive environment, run tests against it, then switch traffic from the active to the newly-deployed environment (typically via DNS or load balancer configuration). Rollback is instant: switch traffic back. The cost is maintaining two full environments (2x infrastructure cost during deployment).

Canary deployment: deploy the new version to a small subset of instances (1-5 percent of traffic). Monitor error rates, latency, and business metrics for a bake period (15-60 minutes). If metrics are healthy, progressively increase traffic to the new version (10 percent, 25 percent, 50 percent, 100 percent). If any stage shows degradation, automatically rollback. This limits the blast radius of a bad deployment.

For database schema changes, use the expand-contract pattern. Phase 1 (expand): add new columns or tables without removing old ones. The new code writes to both old and new schemas. Phase 2 (migrate): backfill existing data to the new schema. Phase 3 (contract): once all code uses the new schema, remove old columns. Each phase is a separate deployment. Never make a breaking schema change in a single step.

For stateful services, discuss the additional complexity: if a service holds in-memory state (caches, sessions, connection pools), draining means waiting for that state to be naturally evacuated or explicitly migrating it. Use external session stores so services can be restarted without losing user sessions.

Relate to the broader availability story: the best HA architecture is useless if every deployment causes a 30-second disruption. Deployment safety is availability. Automate rollback triggers, implement deployment circuit breakers (if error rate increases by X percent during rollout, halt automatically), and practice deployments frequently to reduce their risk. Companies that deploy daily have fewer outages than companies that deploy monthly because each deployment is smaller and better understood.

Follow-up questions:

  • How do you handle deployments that require incompatible API changes?
  • What is the maximum safe batch size for a rolling deployment and how do you determine it?
  • How would you implement zero-downtime deployment for a stateful service like a database?

7. What is the split-brain problem and how do you prevent it in a highly available system?

What the interviewer is really asking: Do you understand the most dangerous failure mode in replicated systems and the mechanisms (fencing, quorums, consensus) that prevent it?

Answer framework:

Split-brain occurs when a network partition causes two or more groups of nodes to independently believe they are the active primary. Both groups accept writes, creating divergent state that is extremely difficult (sometimes impossible) to reconcile. In a database context, split-brain can cause data corruption. In a distributed lock service, it can cause mutual exclusion violations.

Example scenario: a primary database in AZ-1 and a standby in AZ-2. The network link between AZs fails. The standby's health checks against the primary fail, so it promotes itself to primary. Now both AZs have a primary accepting writes. When the partition heals, the data has diverged.

Prevention mechanism 1 is quorum-based decisions. Require a majority (quorum) of nodes to agree before any action. In a 3-node cluster, a quorum is 2. During a partition, only the side with the majority can continue operating. The minority side must refuse writes. This is the approach used by Raft, Paxos, and ZooKeeper. It guarantees that at most one partition can accept writes (because there is only one majority).

Prevention mechanism 2 is fencing (STONITH: Shoot The Other Node In The Head). When a node is promoted, it must first ensure the old primary cannot accept writes. Mechanisms include revoking the old primary's storage access (SAN-level fencing), revoking its network access (switch-level fencing), or sending it a shutdown command. Only after the old primary is confirmed dead does the new primary start accepting writes.

Prevention mechanism 3 is generation numbers (epochs, terms). Each primary is assigned a monotonically increasing generation number. Writes include the generation number. Storage backends reject writes from a stale generation. When a new primary is elected with generation 5, any lingering write from generation 4 is rejected by the storage layer. This is how Raft prevents stale leaders from corrupting data.

Prevention mechanism 4 is an external arbitrator (witness node). Place a lightweight witness node in a third AZ that does not hold data but participates in leader election votes. During a two-way partition, the side that can communicate with the witness has a majority (2 of 3) and continues. The side without the witness voluntarily steps down.

Discuss the relationship with the CAP theorem: split-brain prevention enforces the CP choice. During a partition, you sacrifice availability (the minority partition is unavailable) to prevent inconsistency. Systems that choose AP (like some eventually consistent databases) allow split-brain by design and handle it through conflict resolution after the partition heals.

Follow-up questions:

  • What happens if your fencing mechanism itself has a bug and fails to stop the old primary?
  • How does split-brain apply to services that are stateless?
  • Can you have split-brain with more than two partitions, and how does quorum handle it?

8. How do you design a globally distributed system that maintains high availability across multiple regions?

What the interviewer is really asking: Can you architect a system that survives entire region outages while managing the latency, consistency, and operational complexity of multi-region deployment?

Answer framework:

Multi-region high availability means the system continues operating normally when an entire cloud region (with all its availability zones) becomes unavailable. This is the highest level of infrastructure resilience and introduces unique challenges around data replication, traffic routing, and consistency.

For architecture, deploy the full application stack in at least two (preferably three) geographic regions. Use an active-active multi-region topology where all regions serve traffic simultaneously. Each region should be self-sufficient: it can serve requests without cross-region calls in the critical path. Cross-region communication happens asynchronously for data replication.

For data replication, the fundamental trade-off is latency versus consistency. Synchronous cross-region replication adds 50-200ms (intercontinental round-trip) to every write, which is unacceptable for most applications. Asynchronous replication adds zero latency but creates a window where regions have different data (the replication lag). Most production systems use asynchronous replication with eventual consistency.

For handling the consistency challenge, partition data by ownership. Each piece of data has a home region where writes are authoritative. Reads can be served from any region (possibly stale by the replication lag). For user-specific data, the home region is the region closest to the user. For global shared data, designate one region as the authoritative source. This is similar to how Netflix distributes streaming data globally.

For traffic routing, use global load balancing (DNS-based with health checks, or anycast). When a region fails, DNS health checks detect the failure and stop routing traffic to it. Remaining regions absorb the redistributed traffic. Design each region with sufficient spare capacity (at least 50 percent headroom) to absorb a failed region's traffic without overloading.

For failover testing, regularly perform full region failovers in production. Route all traffic away from one region, verify the system operates normally, then restore. This validates that failover actually works (untested failover is unreliable failover). Netflix performs region evacuations weekly.

Address the stateful services challenge: services with local state (caches, queues) need special handling during failover. Caches will be cold in the receiving region (temporary latency increase). Message queues need cross-region replication or at-least-once delivery guarantees. Implement request replay for messages that were in-flight during the failover.

Discuss cost optimization: running full capacity in three regions means 3x infrastructure cost if each region must handle full load independently. Use auto-scaling to keep warm capacity at 50 percent in each region, with auto-scale policies that trigger during failover to rapidly add capacity in the receiving regions.

Follow-up questions:

  • How do you handle a write conflict when the same record is updated in two regions simultaneously?
  • What is the minimum number of regions needed for high availability and why?
  • How do you test multi-region failover without impacting customer experience?

9. Describe how you would implement a circuit breaker pattern and explain its role in maintaining system availability.

What the interviewer is really asking: Do you understand the cascade failure problem and can you implement the primary defense mechanism against it?

Answer framework:

The circuit breaker pattern prevents a failing dependency from consuming all resources of its callers, which would cascade the failure to their callers, eventually bringing down the entire system. It is named after electrical circuit breakers: when current exceeds a threshold, the breaker trips open to prevent damage.

The state machine has three states. Closed (normal operation): all requests pass through to the downstream service. The circuit breaker monitors failures. If the failure count exceeds a threshold within a time window (for example, 10 failures in 30 seconds), the breaker transitions to open. Open (failing fast): all requests immediately return an error or fallback response without calling the downstream service. This eliminates timeout waits and frees threads and connections. After a configurable timeout (for example, 60 seconds), the breaker transitions to half-open. Half-open (testing recovery): a single request (the probe) is allowed through to the downstream service. If it succeeds, the breaker transitions back to closed. If it fails, the breaker returns to open with a reset timer.

For implementation, define what constitutes a failure. Not all errors should trip the breaker: HTTP 4xx errors are client errors, not service failures. Only count 5xx errors, timeouts, and connection refused errors. Also consider latency: if responses are coming back but taking 10 seconds, the service is effectively down. Include a slow-call threshold that counts responses above P99 latency as failures.

For the fallback strategy when the circuit is open, options include returning cached data (if available and not too stale), returning a default or degraded response (a recommendation service fallback is to show popular items instead of personalized ones), queuing the request for later retry (for non-time-sensitive operations), and propagating the error clearly to the caller (for operations that cannot degrade).

Discuss configuration tuning: the failure threshold determines sensitivity (too low causes unnecessary trips, too high allows cascade damage before tripping). The open timeout determines recovery speed (too short causes rapid oscillation if the downstream is still unhealthy, too long delays recovery). The half-open probe count determines confidence in recovery.

Address the operational dimension: circuit breaker state changes are critical operational events. Alert when a breaker opens (a dependency is failing), log state transitions for post-incident analysis, and display breaker states on operational dashboards. Track the percentage of time each breaker spends in open state as a reliability metric.

Discuss the relationship with retries and timeouts: the circuit breaker works with (not instead of) retry logic. Set timeouts short enough that they do not consume resources for extended periods. Set retry counts low (1-2 retries maximum). The circuit breaker then catches the case where retries are futile because the downstream is consistently failing.

Follow-up questions:

  • How do you determine the optimal failure threshold for a circuit breaker?
  • What is the difference between a circuit breaker and a bulkhead and when would you use each?
  • How would you implement a circuit breaker for an asynchronous message-based system?

10. How do you design an SLA and SLO framework for a complex distributed system with multiple dependencies?

What the interviewer is really asking: Can you translate business availability requirements into measurable engineering objectives and build a monitoring framework that tracks them?

Answer framework:

Distinguish between SLA, SLO, and SLI. The Service Level Agreement (SLA) is a contractual commitment to customers with financial penalties for violation. The Service Level Objective (SLO) is an internal target, typically more stringent than the SLA (if your SLA is 99.9 percent, your SLO might be 99.95 percent, giving you a safety margin). The Service Level Indicator (SLI) is the actual measurement that determines whether you are meeting your objectives.

For defining SLIs, measure what users actually experience: availability (percentage of requests that succeed), latency (percentage of requests completing within a threshold), correctness (percentage of requests returning the right answer), and throughput (whether the system can handle the required request rate). Each SLI needs a precise measurement definition: what counts as a request, what counts as success, and where in the stack you measure.

For compound systems, derive the system SLO from component SLOs using the serial and parallel calculations from Question 1. If your system has a critical path through 5 services each with a 99.99 percent SLO, your system SLO cannot exceed 99.95 percent without redundancy. Work backward from the desired system SLO to determine required component SLOs.

Implement error budgets: if your SLO is 99.95 percent over 30 days, your error budget is 0.05 percent of total requests (approximately 21.6 minutes of downtime). Track budget consumption in real-time. When the budget is nearly exhausted, freeze deployments and focus on reliability work. When the budget is healthy, invest in feature velocity. This balances reliability with development speed.

For dependencies, set SLO requirements on your upstream dependencies and monitor them. If a dependency provides 99.9 percent availability but you need 99.99 percent from your system, you must implement redundancy or fallbacks for that dependency. Do not build a system that requires a dependency to exceed its own SLO.

Discuss the operational workflow: alert on SLO burn rate, not on instantaneous error rates. A spike in errors that consumes 1 percent of your monthly budget in 5 minutes should page an engineer. A low-level elevation that will consume the budget over 3 days should create a ticket. This prevents alert fatigue while catching both acute incidents and slow degradation.

For multi-tenant systems, consider per-customer SLOs. An enterprise customer paying 100x might have a 99.99 percent SLO while a free-tier user has 99.9 percent. Implement priority-based resource allocation to guarantee differential SLOs. This connects to pricing models and tier differentiation.

Follow-up questions:

  • How do you handle SLOs when you depend on a third-party service with no SLA?
  • What do you do when your SLO targets conflict with feature development velocity?
  • How would you implement SLO tracking for an eventually consistent system where correctness is hard to measure?

11. Explain how consensus algorithms contribute to high availability and what are the trade-offs.

What the interviewer is really asking: Do you understand the theoretical foundations of distributed coordination and can you explain how Raft or Paxos work in practice?

Answer framework:

Consensus algorithms enable a group of nodes to agree on a value (or a sequence of values) even when some nodes fail. This is fundamental to high availability because it enables replicated state machines: if all nodes start in the same state and process the same sequence of operations, they maintain identical state. When a node fails, others continue from the agreed-upon state.

Explain Raft as the most approachable consensus algorithm. A Raft cluster has a single leader that accepts writes and replicates them to followers. The leader sends heartbeats to followers. If a follower does not receive a heartbeat within an election timeout, it starts an election. To win, a candidate must receive votes from a majority. The leader commits a log entry only after a majority has acknowledged it. This guarantees that any committed entry exists on at least a majority of nodes.

The availability implications: a Raft cluster with N nodes tolerates (N-1)/2 failures. A 3-node cluster tolerates 1 failure, a 5-node cluster tolerates 2. During normal operation, the system is highly available. During a leader failure, the system is unavailable for the duration of the election (typically 150-300ms). This is the unavailability window of consensus-based systems.

Trade-offs of consensus for HA. Availability versus consistency: consensus guarantees consistency (linearizability) but sacrifices availability during leader elections and when a quorum is not reachable. Per the CAP theorem, you cannot have both during a network partition. Latency: every write requires a round-trip to a majority of nodes. For cross-region deployments, this adds significant latency (the write latency equals the round-trip to the closest quorum member). Throughput: the single leader is a throughput bottleneck. All writes flow through one node. Mitigate with batching and pipelining.

Discuss practical systems that use consensus: etcd (Kubernetes coordination), ZooKeeper (distributed locking and configuration), CockroachDB (distributed SQL), and Consul (service discovery). Each uses consensus for a specific purpose: metadata coordination, not bulk data storage (consensus does not scale to high write throughput).

For high availability patterns using consensus: use a 5-node cluster spread across 3 availability zones (2-2-1 distribution). This survives any single AZ failure (losing 1 or 2 nodes still leaves a quorum). A 3-AZ deployment with 3 nodes (1-1-1) tolerates 1 AZ failure. A 2-AZ deployment with any number of nodes cannot tolerate an AZ failure without split-brain risk (neither side has a guaranteed majority).

Address when not to use consensus: for data that can tolerate eventual consistency, consensus adds unnecessary latency and complexity. Use eventual consistency with conflict resolution for user data that is rarely concurrently modified. Reserve consensus for coordination tasks: leader election, distributed locks, and configuration changes.

Follow-up questions:

  • What happens to a Raft cluster when exactly half the nodes are on each side of a partition?
  • How does multi-Raft (sharding the consensus group) improve throughput?
  • What are the operational challenges of running a consensus cluster in production?

12. How do you implement effective chaos engineering to validate high availability claims?

What the interviewer is really asking: Do you go beyond theoretical HA design and actually test that your systems survive failures as expected?

Answer framework:

Chaos engineering is the practice of deliberately injecting failures into production systems to verify that they handle those failures correctly. The core principle is that if you believe your system is resilient to a failure mode, you should prove it by causing that failure rather than waiting for it to happen at 3 AM.

The chaos engineering workflow has five steps. Form a hypothesis (for example, if we kill one of three database replicas, the system will continue serving requests with no user-visible impact). Define steady state (metrics that indicate normal operation: request success rate above 99.9 percent, p99 latency under 200ms). Introduce the failure (kill the replica). Observe the impact (did metrics remain in steady state?). Conclude (confirm or deny the hypothesis, fix gaps).

Start with simple failure injections and progress to complex ones. Level 1: kill a single service instance (validates that load balancing and health checks work). Level 2: kill all instances in one availability zone (validates multi-AZ redundancy). Level 3: inject network latency or packet loss between services (validates timeout and retry behavior). Level 4: inject disk full or CPU saturation (validates resource exhaustion handling). Level 5: simulate entire region failure (validates multi-region failover).

For implementation, use established tools: AWS Fault Injection Simulator, Gremlin, Chaos Mesh (for Kubernetes), or custom scripts that interface with your infrastructure APIs. Implement a chaos controller that schedules experiments, monitors impact, and automatically rolls back if the blast radius exceeds expectations (safety net).

Discuss blast radius control: never inject failures into 100 percent of traffic on the first attempt. Start with 1 percent of traffic or a single canary instance. Define abort conditions: if the error rate exceeds X percent or latency exceeds Y ms, automatically terminate the experiment. Have a human operator monitoring during initial runs of any new experiment.

For organizational adoption, start with non-production environments to build confidence. Graduate to production during low-traffic periods. Eventually run chaos experiments continuously during business hours (this is where Netflix operates). Document every experiment and its findings. When experiments reveal weaknesses, fix them and re-run the experiment to confirm the fix.

Relate to HA validation: every redundancy mechanism in your architecture should have a corresponding chaos experiment that validates it. If you claim multi-AZ HA, run AZ failure experiments monthly. If you claim the system handles database failover, actually fail the database regularly. Untested redundancy is not redundancy. See our distributed systems guide for more on resilience testing practices.

Follow-up questions:

  • How do you convince stakeholders to allow deliberate failure injection in production?
  • What is the difference between chaos engineering and traditional failure testing?
  • How would you implement chaos engineering for a system that processes financial transactions?

13. Design a highly available message queue that guarantees at-least-once delivery.

What the interviewer is really asking: Can you design a system where messages are never lost despite producer failures, broker failures, and consumer failures?

Answer framework:

At-least-once delivery means every message that is successfully accepted by the queue will eventually be delivered to a consumer, even if this requires delivering it multiple times. The guarantee is: no message loss. The trade-off is: possible duplicate delivery (consumers must be idempotent).

For the producer side, guarantee that a message is durably stored before acknowledging the producer. Write the message to a replicated log (like Kafka's approach): the message is written to the leader partition and replicated to a configurable number of followers (in-sync replicas). Only acknowledge the producer after the required number of replicas have persisted the message. With a replication factor of 3 and acks=all, the message survives the loss of 2 brokers.

For the broker side, the message log is the source of truth. It is an append-only, immutable sequence of messages. Replicate across multiple brokers using a consensus protocol or synchronous replication. Each partition (a unit of parallelism) has a leader that handles all reads and writes, and followers that replicate. If the leader fails, a follower with all committed messages is elected as the new leader.

For the consumer side, track consumption progress using consumer offsets. A consumer reads a message, processes it, then commits the offset (marking the message as consumed). If the consumer crashes before committing the offset, the message will be re-delivered on restart (this is the at-least-once guarantee). Never commit the offset before processing is complete.

For durability, write the log to persistent storage with fsync. Without fsync, the OS might cache writes in the page cache and a crash would lose data. The trade-off: fsync after every message adds latency (milliseconds). Fsync periodically (every 1000 messages or every second) trades a small durability window for better throughput. Production systems usually fsync at configurable intervals and rely on replication for protection during the fsync window.

Discuss the relationship with exactly-once delivery: true exactly-once requires either idempotent consumers (process the same message twice with no effect) or transactional delivery (atomic consume-and-produce). Kafka implements exactly-once through idempotent producers (deduplication on the broker side) and transactional consumers (commit offset and output atomically). This eliminates duplicates at the cost of throughput.

Address the dead letter queue pattern: if a message fails processing repeatedly (poison message), move it to a dead letter queue after N attempts rather than blocking the entire queue. Alert operators to investigate. This maintains availability by preventing one bad message from stalling all processing.

Follow-up questions:

  • How do you handle message ordering in a replicated queue during a failover?
  • What is the maximum message throughput achievable with strong durability guarantees?
  • How would you implement priority-based message delivery in a highly available queue?

14. How would you design a system to handle a sudden 100x traffic spike without downtime?

What the interviewer is really asking: Do you understand elastic scalability, traffic management, and the practical limits of auto-scaling?

Answer framework:

A 100x spike is beyond what auto-scaling alone can handle (cloud providers need minutes to provision instances, and databases cannot scale 100x in real-time). Surviving such a spike requires multiple defensive layers working together.

Layer 1 is traffic absorption at the edge. Use a CDN to serve cacheable content (static assets, API responses with appropriate cache headers). During a traffic spike caused by a viral event, most requests are reads for the same content. A CDN can absorb thousands-fold traffic increases for cached content with no origin impact. Ensure your load balancing tier can handle the connection count even if most requests are served from cache.

Layer 2 is rate limiting and admission control. Implement per-client rate limits so no single client can consume excessive resources. See how rate limiting works for implementation details. During extreme spikes, apply global rate limits that cap total admitted traffic at the system's maximum capacity. Queue excess requests rather than rejecting them immediately (with a timeout).

Layer 3 is auto-scaling with pre-warming. Configure auto-scaling with aggressive scale-up policies (add 50 percent capacity when CPU exceeds 60 percent) and conservative scale-down policies (remove capacity only after 15 minutes of low utilization). For expected spikes (marketing campaigns, product launches), pre-scale in advance. For unexpected spikes, the auto-scaler provides capacity within 3-5 minutes for stateless services.

Layer 4 is graceful degradation. When the spike exceeds auto-scaling capacity, degrade non-essential features. Disable expensive operations (search, recommendations, analytics), serve cached or simplified responses, and reduce page complexity. This maintains the core user experience while reducing per-request resource consumption.

Layer 5 is database protection. Databases are the hardest layer to scale quickly. Protect with connection pooling (cap the number of database connections regardless of application tier size), read replicas (redirect read traffic to replicas during spikes), and query result caching (cache expensive query results with short TTLs). For the Redis vs Memcached question, both work well as database result caches during traffic spikes.

Address the cold cache problem: a 100x spike on a fresh cache means 100x database load. Pre-warm caches for anticipated events. For unanticipated spikes, implement cache stampede prevention (only one request fetches from the database, others wait for the cache to be populated).

Discuss the economics: maintaining 100x capacity at all times is prohibitively expensive. The strategy is layers of defense that progressively engage as load increases. Most spikes are handled by the CDN and cache layers (near-zero marginal cost). Only sustained spikes that overflow the cache require actual compute scaling.

Follow-up questions:

  • What is the maximum scale factor you can handle purely through auto-scaling, and what are the bottlenecks?
  • How do you handle database connection exhaustion during a traffic spike?
  • How would you design for a predictable spike like a major sporting event versus an unpredictable viral event?

15. How do you design a disaster recovery strategy with defined RPO and RTO targets?

What the interviewer is really asking: Can you translate business continuity requirements into concrete technical architecture and operational procedures?

Answer framework:

Recovery Point Objective (RPO) is the maximum acceptable data loss measured in time. An RPO of 1 hour means you can lose up to 1 hour of data. Recovery Time Objective (RTO) is the maximum acceptable downtime. An RTO of 4 hours means the system must be operational within 4 hours of a disaster.

Map RPO and RTO to technical implementations. RPO equals zero (no data loss): requires synchronous replication to the DR site. Every write is confirmed at both primary and DR before acknowledging. Cost: increased write latency (cross-region round-trip) and complexity. RPO under 1 minute: asynchronous replication with frequent checkpointing. Replication lag is typically seconds. Cost: possible loss of last few seconds of writes. RPO of 1 to 24 hours: periodic backup to the DR site (full backups plus incremental). Cost: potential loss of data since last backup.

Map RTO to architecture. RTO under 1 minute: active-active multi-region with automatic failover. The DR site is already running and serving traffic. Failover is simply routing all traffic to the surviving region. RTO of 1 to 15 minutes: warm standby with automated promotion. The DR site has infrastructure running but not serving traffic. Automated scripts promote it and restore from replicated data. RTO of 1 to 4 hours: cold standby with manual intervention. The DR site has infrastructure defined (Infrastructure as Code) but not running. On disaster declaration, spin up infrastructure, restore from backups, verify, and redirect traffic. RTO of 4 to 24 hours: backup-only DR. Restore from backups to new infrastructure. This is the cheapest option but carries the most downtime risk.

For the DR plan, document: detection (how do you know a disaster has occurred), declaration (who decides to activate DR and what is the criteria), execution (step-by-step failover procedure), verification (how do you confirm the DR site is operating correctly), and restoration (how do you fail back to the primary after the disaster is resolved).

Discuss testing the DR plan. A DR plan that has never been tested has an estimated 50 percent failure rate when actually needed. Conduct DR drills quarterly: actually fail over to the DR site, run the system there, then fail back. Track the actual RTO achieved during drills versus the target. Fix gaps between actual and target.

Address the cost optimization dimension: RPO of zero and RTO under 1 minute requires full active-active multi-region deployment (2-3x infrastructure cost). Most systems accept a tiered approach: critical data (user accounts, financial records) gets zero RPO with synchronous replication. Important data (user content, messages) gets RPO under 1 minute with async replication. Replaceable data (caches, derived data, analytics) gets no explicit DR (it can be regenerated from source data).

Relate to compliance: many industries have regulatory requirements for DR (financial services require documented BCP/DR plans, healthcare has data retention requirements). Align your RPO and RTO targets with regulatory minimums, then exceed them based on business value. See our system design interview guide for how DR questions fit into broader architecture discussions.

Follow-up questions:

  • How do you handle a disaster that affects both your primary and DR sites simultaneously?
  • What is the relationship between RPO and the cost of synchronous replication?
  • How would you implement DR for a system with complex cross-service data dependencies?

Common Mistakes in High Availability Interviews

  1. Confusing high availability with disaster recovery. HA handles routine failures (a server crash, a network hiccup) with automatic recovery in seconds. DR handles catastrophic events (entire region loss, data corruption) with planned recovery in minutes to hours. Discuss both but do not conflate them.

  2. Not quantifying availability targets. Saying a system should be highly available is meaningless without numbers. Always anchor your discussion in specific nines, downtime budgets, and SLO targets. Relate these to business impact.

  3. Ignoring the cost dimension. Every additional nine of availability requires approximately 10x more investment. Acknowledge that engineering for five nines when three nines suffice is wasteful. Let business requirements drive availability targets.

  4. Designing redundancy without testing it. Adding a standby database does not improve availability if the failover mechanism has never been tested. For every redundancy mechanism, describe how you would validate it through chaos engineering or DR drills.

  5. Overlooking human factors. The majority of outages are caused by human error: misconfigurations, bad deployments, incorrect operational procedures. Discuss automation, guardrails, and operational practices alongside technical architecture.

How to Prepare for High Availability Interviews

Build practical experience with failure modes by working through real-world scenarios. Set up a multi-node database cluster and practice failover procedures. Deploy a service across multiple availability zones and simulate zone failures. Implement a circuit breaker and observe its behavior under dependency failures.

Study post-mortem reports from major outages at Google, AWS, Cloudflare, and other infrastructure providers. Each report reveals failure patterns that your HA architecture should address: cascading failures, split-brain incidents, clock skew issues, and configuration errors.

Master the theoretical foundations: understand the CAP theorem and its practical implications, learn how consensus algorithms work and their availability limitations, study eventual consistency models and their convergence guarantees, and understand the fundamentals of load balancing for traffic distribution.

Practice articulating trade-offs. Every HA decision involves trading something (cost, latency, complexity, consistency) for availability. Interviewers want to hear you reason through these trade-offs, not recite a single correct answer. For comprehensive preparation, see our system design interview guide and distributed systems guide. Explore the learning paths for structured study plans tailored to reliability engineering roles, and review our pricing page for access to advanced practice scenarios.

Related Resources

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.