Split-Brain Problem Explained: When Distributed Systems Disagree on Who Is in Charge

How split-brain occurs in distributed systems — causes, consequences, fencing tokens, STONITH, quorum-based prevention, and real-world outage examples.

split-braindistributed-systemsfault-toleranceconsensushigh-availability

Split-Brain Problem

Split-brain is a failure mode in distributed systems where a network partition causes two or more subsets of nodes to independently believe they are the active primary, leading to conflicting operations and potential data corruption.

What It Really Means

Consider a database cluster with a primary node and a standby. The standby monitors the primary's health via heartbeats. If the primary goes silent — not because it crashed, but because the network between them is broken — the standby assumes the primary is dead and promotes itself to primary. Now you have two nodes, both accepting writes, both believing they are the legitimate primary. This is split-brain.

The damage from split-brain can be catastrophic. Both primaries accept conflicting writes. An inventory system might sell the same item twice. A banking system might process conflicting transfers. When the partition heals and the two primaries discover each other, reconciling their divergent states may be impossible without data loss.

Split-brain is the reason that distributed systems engineers lose sleep. It is a failure mode that does not show up in unit tests, rarely occurs during normal operation, and causes maximum damage when it does occur. Every high-availability system must have a split-brain prevention strategy.

How It Works in Practice

How Split-Brain Occurs

Real-World Split-Brain Incidents

GitHub (2012): A network partition caused MySQL replicas to disagree on which node was the primary. The failover system promoted a replica that was behind, causing data loss for some repositories.

Elasticsearch clusters: Before version 7, Elasticsearch was notorious for split-brain. With a 3-node cluster and minimum_master_nodes set to 1 (the default), a partition could create two independent clusters, each with its own master. Version 7+ fixed this with a built-in voting-based quorum.

Redis Sentinel: If Redis Sentinel loses connectivity to the primary but the primary is still serving clients, Sentinel promotes a replica. Clients connected to the old primary continue writing to it while new clients write to the new primary.

Prevention Strategy 1: Quorum-Based Fencing

The most reliable prevention: require a majority (quorum) to operate. In a 5-node cluster, a node needs 3 votes to be the primary. During a partition, at most one side has the majority. The minority side cannot elect a primary.

etcd and Raft: Raft requires a majority for leader election. With 5 nodes partitioned into groups of 3 and 2, only the group of 3 can elect a leader. The group of 2 has no leader and refuses writes.

Prevention Strategy 2: STONITH (Shoot The Other Node In The Head)

Before a standby promotes itself, it forcibly shuts down the old primary — typically by sending a command to the server's management interface (IPMI/iLO) to power off the machine. This guarantees the old primary cannot continue accepting writes.

Pacemaker/Corosync (Linux HA) uses STONITH as its primary split-brain prevention. If the fencing device is unreachable, the failover is aborted entirely.

Prevention Strategy 3: Fencing Tokens

A coordination service (ZooKeeper, etcd) issues monotonically increasing tokens. When a new primary is elected, it receives token N+1. The storage system rejects any writes with token N or lower. Even if the old primary (holding token N) is still running, its writes are rejected.

Implementation

python
python

Trade-offs

Prevention Strategies Compared

StrategyProsCons
QuorumNo external dependenciesRequires odd number of nodes; minority is unavailable
STONITHGuarantees old primary is stoppedRequires hardware access; fencing failure blocks failover
Fencing tokensWorks with any storageRequires all storage to check tokens
Lease-basedTime-bounded; simpleDepends on clock synchronization

Advantages of Split-Brain Prevention

  • Data integrity: Prevents conflicting writes and data corruption
  • Deterministic behavior: Clear rules for which side operates during partitions
  • Automated recovery: Systems can heal without manual intervention

Disadvantages

  • Reduced availability: The minority side of a partition becomes unavailable
  • Complexity: Fencing mechanisms add operational and engineering complexity
  • False positives: Overly aggressive failure detection can trigger unnecessary fencing, causing availability loss

Common Misconceptions

  • "Split-brain only happens with two nodes" — Split-brain can occur with any number of nodes if the system does not use quorum-based consensus. Even a 10-node cluster can split into two groups of 5, and without a tiebreaker, both groups might elect a leader.
  • "Heartbeats prevent split-brain"Heartbeats detect node failures but do not distinguish between a dead node and a partitioned one. Acting on a missed heartbeat without quorum can cause split-brain.
  • "Having a standby database prevents split-brain" — A standby with automatic failover is the most common CAUSE of split-brain. If the standby promotes while the old primary is still alive, you have two primaries.
  • "Split-brain always involves data loss" — Not always. If conflicting writes are to different keys, reconciliation may be straightforward. The damage depends on what was written during the split.
  • "Cloud-managed databases do not have split-brain issues" — Managed databases (RDS, Cloud SQL) have their own split-brain prevention mechanisms, but they can still experience brief availability gaps during partition-triggered failovers.

How This Appears in Interviews

Split-brain is a critical topic in distributed systems interviews:

  • "How do you prevent two primaries in a database cluster?" — Quorum-based leader election (Raft/Paxos), fencing tokens, or STONITH. Explain why heartbeat-based failover alone is insufficient.
  • "Your Redis cluster has two masters. What happened?" — Network partition caused Sentinel to promote a replica while the original primary was still running. Explain the fix: require quorum for promotion, use fencing.
  • "Design a highly available system that never corrupts data" — Choose CP during partitions. Use consensus protocols for leader election. Implement fencing tokens for all write operations.
  • "What is the difference between split-brain and a network partition?" — A network partition is the cause (network failure). Split-brain is the consequence (two nodes both acting as primary).

See our interview questions on distributed systems for more practice.

Related Concepts

GO DEEPER

Learn from senior engineers in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.