Leader Election Explained: Choosing a Coordinator in Distributed Systems

How leader election works — Raft, Bully, and ZAB algorithms, why distributed systems need leaders, failure detection, and split-brain prevention.

leader-electiondistributed-systemsconsensusraftcoordination

Leader Election

Leader election is the process by which nodes in a distributed system agree on a single node to act as the coordinator, responsible for making decisions, ordering operations, or managing shared resources.

What It Really Means

Many distributed systems need one node to be "in charge" at any given time. A database cluster needs a primary that accepts writes. A distributed lock service needs a master that grants locks. A message broker needs a controller that manages partition assignments. Without a designated leader, you get chaos — conflicting decisions, duplicate work, and data corruption.

The hard part is not choosing a leader when everything is working. The hard part is choosing a new leader when the current one fails — and knowing that the current one has actually failed rather than just being slow or unreachable due to a network partition. If two nodes both believe they are the leader (a split-brain scenario), the system can accept conflicting writes and corrupt data.

Leader election algorithms must guarantee two properties: safety (at most one leader at any time) and liveness (the system eventually elects a leader). Achieving both in the presence of network partitions is fundamentally difficult, which is why battle-tested algorithms like Raft and ZAB exist.

How It Works in Practice

Raft Leader Election (etcd, CockroachDB, TiKV)

Raft divides time into terms (numbered epochs). Each term has at most one leader. The election process:

  1. A follower does not hear from the leader within the election timeout (randomized, e.g., 150-300ms)
  2. It increments its term, transitions to candidate state, and votes for itself
  3. It sends RequestVote RPCs to all other nodes
  4. Each node votes for the first candidate it hears from in a given term (first-come, first-served)
  5. If a candidate receives votes from a majority (quorum), it becomes the leader
  6. The new leader sends periodic heartbeats to prevent new elections

The randomized election timeout is critical — it makes it unlikely that two nodes start an election simultaneously, preventing split votes.

ZAB Protocol (Apache ZooKeeper)

ZooKeeper uses the ZAB (ZooKeeper Atomic Broadcast) protocol. Leader election in ZAB prioritizes nodes with the most up-to-date transaction log. This ensures the new leader has all committed transactions, preventing data loss during failover.

Bully Algorithm

The Bully algorithm is simpler but less robust. The node with the highest ID always wins. When a node detects the leader has failed, it sends election messages to all higher-ID nodes. If none respond, it declares itself leader. If a higher-ID node responds, it takes over.

Kafka uses a variation where the controller (a broker elected via ZooKeeper or KRaft) manages partition leader assignment. When a broker fails, the controller reassigns its partition leaders to other brokers in the ISR (In-Sync Replicas) list.

Google Chubby

Google's Chubby lock service uses Paxos for leader election. Chubby provides a distributed lock that services use to elect a leader among themselves. The service that acquires the Chubby lock is the leader. If it fails, the lock is released after a timeout, and another service acquires it.

Implementation

python

Trade-offs

Advantages

  • Single coordinator simplifies design: One leader makes ordering, conflict resolution, and decision-making straightforward
  • Well-understood algorithms: Raft, Paxos, and ZAB are formally proven correct
  • Automatic failover: Systems detect leader failure and elect a new one within seconds

Disadvantages

  • Leader is a bottleneck: All writes or coordination go through one node, limiting throughput
  • Failover latency: Election takes time (typically 1-10 seconds), during which the system may be unavailable for writes
  • Complexity of correctness: Implementing leader election correctly is notoriously difficult. Subtle bugs lead to split-brain or data loss
  • Network partition sensitivity: Partitioned leaders may not know they have been replaced, leading to stale operations

Algorithm Comparison

AlgorithmUsed ByComplexityKey Property
Raftetcd, CockroachDBModerateUnderstandable by design
PaxosGoogle Chubby, SpannerHighMathematically proven
ZABZooKeeperModeratePreserves transaction order
BullySimple systemsLowHighest ID wins

Common Misconceptions

  • "Leader election guarantees exactly one leader at all times" — During elections, there may be a brief period with no leader (availability gap). During network partitions, there may briefly appear to be two leaders, but safety mechanisms (term numbers, fencing tokens) prevent both from making progress.
  • "The leader is always the best node" — In Raft, any node can become leader. The first candidate to get a majority wins. Raft does not consider node capacity, load, or data freshness (though ZAB does prioritize data freshness).
  • "Leader election is only for databases" — Leader election appears in task schedulers (one scheduler assigns jobs), distributed caches (one node handles cache invalidation), and microservices (one instance runs periodic batch jobs).
  • "You should implement your own leader election" — Do not. Use etcd, ZooKeeper, or Consul. Leader election has subtle edge cases that take years to get right. Let battle-tested systems handle it.

How This Appears in Interviews

Leader election is a core building block in system design interviews:

  • "How do you ensure only one instance runs a cron job in a cluster?" — Leader election via distributed lock (ZooKeeper, etcd, or Redis SETNX with expiry).
  • "Your database primary just died. Walk me through failover." — Followers detect missing heartbeats, trigger election, new leader is elected, clients are redirected. Discuss data loss risk with async replication.
  • "How does Kafka handle broker failures?" — The controller (elected leader) reassigns partition leaders to ISR members.
  • "What happens if the leader is partitioned from the cluster?" — Fencing tokens, term numbers, or lease expiry prevent the old leader from making writes after a new leader is elected.

See our interview questions on distributed systems for practice.

Related Concepts

GO DEEPER

Learn from senior engineers in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.