Master distributed systems from CAP theorem to consensus algorithms, replication, sharding, and distributed transactions with real-world examples.

Distributed Systems Fundamentals

Distributed systems power every large-scale application you use daily. When you send a message on WhatsApp, stream a video on Netflix, or search on Google, dozens to thousands of machines coordinate to serve your request in milliseconds. Understanding how these systems work is not optional for senior engineers — it is the foundation of every system design interview and every production architecture decision you will make.

This guide covers the core concepts, algorithms, and patterns that underpin modern distributed systems. Each section includes the theory you need to understand, the real-world systems that implement it, and the trade-offs that determine which approach to use. Whether you are preparing for a system design interview at a top tech company or building production infrastructure, this is the reference you need.

CAP Theorem and the Reality of Distributed Trade-offs

The CAP theorem, formulated by Eric Brewer in 2000 and formally proved by Gilbert and Lynch in 2002, states that a distributed data store can provide at most two of three guarantees simultaneously: Consistency (every read returns the most recent write), Availability (every request receives a non-error response), and Partition Tolerance (the system operates despite network partitions between nodes).

What CAP Actually Means

The most common misunderstanding of CAP is "pick any two." In reality, network partitions are inevitable in any distributed system — switches fail, cables get cut, cloud availability zones lose connectivity. Partition tolerance is not optional. The real question is:

When a network partition occurs, does your system prioritize consistency or availability?

CP systems reject requests that cannot guarantee consistency during a partition. A client connected to the minority side of a partition receives errors. Examples: ZooKeeper, etcd, MongoDB (with majority write concern), Google Spanner.
AP systems continue serving requests during a partition, potentially returning stale data. Both sides of the partition accept reads and writes independently. Examples: Cassandra (with consistency level ONE), DynamoDB (default), Riak, CouchDB.

When no partition exists (the normal case), systems can provide both consistency and availability. CAP only constrains behavior during failure scenarios.

Beyond CAP: The PACELC Framework

Daniel Abadi extended CAP with the PACELC framework, which captures an additional trade-off: even when no partition exists (Else), there is a tension between Latency and Consistency.

PA/EL (DynamoDB default): Available during partitions, low latency otherwise. Eventual consistency.
PC/EC (MongoDB): Consistent during partitions, consistent otherwise. Higher latency due to synchronous replication.
PA/EC (Cassandra with tunable consistency): Available during partitions, but can be configured for consistency per-query when no partition exists.

PACELC is more useful in practice because the latency-consistency trade-off affects every single request, not just the rare partition scenario.

Real-World Example: Amazon DynamoDB

Amazon designed DynamoDB (originally Dynamo) to be an AP system because for their e-commerce use case, availability is more important than immediate consistency. If a customer adds an item to their cart during a network partition, Amazon would rather show a slightly stale cart than return an error. The famous "shopping cart" example from the original Dynamo paper explains that it is always better to accept a write and reconcile later than to reject a customer action.

DynamoDB offers both eventually consistent reads (default, lower latency) and strongly consistent reads (optional, higher latency, only from the leader). This per-request tunability is a pattern adopted by many modern databases.

For a deeper dive, read our CAP theorem concept page which covers common misconceptions and interview discussion strategies.

Real-World Example: Google Spanner

Google Spanner takes the opposite approach. It is a globally distributed CP system that uses synchronized atomic clocks (TrueTime) to achieve external consistency — the strongest possible consistency guarantee. Spanner can tell you the exact real-world time at which a transaction committed, globally.

The cost is latency: a write in Spanner requires a quorum of replicas across data centers to acknowledge, which adds tens of milliseconds. For Google's advertising and financial systems, this consistency is worth the latency penalty.

Trade-offs and When to Choose

Scenario	Choose CP	Choose AP
Financial transactions	Yes — incorrect balances cause real harm	No
Social media feeds	No	Yes — stale posts are acceptable
Inventory management	Depends — overselling is costly, but availability drives revenue	Depends
User session storage	No	Yes — showing a logged-out state is worse than a stale session
Configuration management	Yes — inconsistent config causes cascading failures	No

Consistency Models

Consistency models define the contract between a distributed system and its clients about what values a read operation can return. Understanding the spectrum from strong to weak consistency is essential for designing systems that balance correctness, performance, and availability.

Strong Consistency (Linearizability)

Linearizability is the strongest consistency model. Every operation appears to take effect at a single instant in time, and that instant falls between the operation's invocation and response. If client A writes a value and client B reads after A's write returns, B is guaranteed to see A's write.

Linearizability is expensive because it typically requires synchronous communication with a quorum of replicas. Every write must be acknowledged by a majority before it is considered committed.

Systems that provide linearizability: ZooKeeper (reads through the leader), etcd, Google Spanner, CockroachDB.

Sequential Consistency

Sequential consistency guarantees that all operations appear to execute in some sequential order, and the operations from each individual client appear in the order they were issued. However, there is no real-time ordering guarantee — a write that completed before a read started might not be visible to the reader.

Practical implication: If you write a value and then immediately read it from a different replica, you might not see your own write. But within a single client's operations, ordering is preserved.

Causal Consistency

Causal consistency preserves the ordering of causally related operations. If operation A causally precedes operation B (e.g., A writes a value that B reads), then every node sees A before B. Concurrent operations (those with no causal relationship) can be seen in different orders by different nodes.

Causal consistency is attractive because it provides meaningful ordering guarantees without the performance cost of linearizability. MongoDB offers causal consistency through causal sessions.

Example: In a social media system, if user A posts a comment and user B replies to that comment, every user must see A's comment before B's reply. But two unrelated comments by different users can appear in different orders to different readers.

Eventual Consistency

Eventual consistency guarantees that if no new updates are made, all replicas will eventually converge to the same value. It provides no guarantee about how long convergence takes or what intermediate values might be visible.

Systems that use eventual consistency: DynamoDB (default reads), Cassandra (with low consistency levels), DNS, Amazon S3.

Eventual consistency is acceptable when stale reads are tolerable. A product catalog, a social media timeline, or a content caching layer can tolerate reading a slightly outdated value for a few seconds.

Read-Your-Writes Consistency

A practical consistency guarantee that ensures a client always sees the effect of its own writes, even if it connects to a different replica. This does not guarantee that other clients see the write immediately.

Implementation strategies:

Route reads after a write to the leader replica for a short window
Track the write timestamp and only serve reads from replicas that are at least that up-to-date
Use sticky sessions to pin a client to a specific replica

Monotonic Reads

Monotonic reads guarantee that if a client reads a value at time T1, any subsequent read at time T2 > T1 will return the same value or a more recent one. The client never sees time "go backward."

Without monotonic reads: A client reads from replica A (which has the latest value), then reads from replica B (which is behind). The client sees a newer value followed by an older value, which is confusing.

How Google, Netflix, and Amazon Choose Consistency Models

Google uses strong consistency for Spanner (financial data) but eventual consistency for many internal services where performance matters more than immediate accuracy. Netflix uses eventual consistency pervasively — the recommendation engine, viewing history, and user profiles all tolerate stale reads. Amazon DynamoDB defaults to eventual consistency but offers per-request strong consistency for operations that need it, such as checking inventory before confirming an order.

For interview preparation, you should be able to explain why you would choose a specific consistency model for a given use case and articulate the performance implications.

Consensus Algorithms: Raft and Paxos

Consensus is the problem of getting multiple nodes to agree on a single value, even when some nodes fail. It is the foundation of replicated state machines, distributed databases, leader election, and coordination services.

The Consensus Problem

A correct consensus algorithm must satisfy three properties:

Agreement: All non-faulty nodes decide on the same value.
Validity: The decided value was proposed by some node.
Termination: All non-faulty nodes eventually decide.

The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proves that no deterministic consensus algorithm can guarantee all three properties in an asynchronous system with even one faulty process. Practical algorithms work around this by using timeouts, randomization, or partial synchrony assumptions.

Paxos

Leslie Lamport introduced Paxos in 1989. It is the foundational consensus algorithm, but it is notoriously difficult to understand and implement correctly.

Single-decree Paxos decides on a single value through two phases:

Phase 1: Prepare

A proposer selects a proposal number n (must be unique and higher than any previous proposal number it has used).
The proposer sends a Prepare(n) message to a majority of acceptors.
Each acceptor responds with a Promise(n, v_accepted) if n is higher than any prepare request it has previously responded to. It includes the value of the highest-numbered proposal it has already accepted, if any.

Phase 2: Accept

If the proposer receives promises from a majority, it sends an Accept(n, v) message. The value v is either the value from the highest-numbered proposal among the promises, or the proposer's own value if no acceptor had previously accepted a value.
Each acceptor accepts the proposal if it has not promised to a higher-numbered proposal.
Once a majority of acceptors accept, the value is chosen.

Multi-Paxos extends single-decree Paxos to decide on a sequence of values (a replicated log). A stable leader skips Phase 1 for subsequent proposals, reducing the algorithm to a single round trip per log entry.

Where Paxos is used: Google's Chubby lock service, Google Spanner's replication layer, and Apache Mesos.

Raft

Diego Ongaro and John Ousterhout designed Raft in 2014 specifically to be understandable. It achieves the same safety guarantees as Multi-Paxos but decomposes the problem into three sub-problems: leader election, log replication, and safety.

Leader Election

Raft nodes are in one of three states: follower, candidate, or leader. Time is divided into terms (logical clocks). Each term has at most one leader.

Followers receive heartbeats from the leader. If a follower receives no heartbeat for an election timeout (randomized between 150-300ms), it becomes a candidate.
The candidate increments its term, votes for itself, and sends RequestVote RPCs to all other nodes.
A node grants its vote if the candidate's term is at least as high as its own and the candidate's log is at least as up-to-date as its own.
If the candidate receives votes from a majority, it becomes the leader. If it learns of a higher term, it steps down to follower.

Log Replication

The leader receives client requests and appends them as entries to its log.
The leader sends AppendEntries RPCs to all followers with the new log entries.
Followers append the entries to their logs and respond.
Once a majority of followers acknowledge, the leader commits the entry and responds to the client.
The leader notifies followers of committed entries in subsequent heartbeats.

Safety

Raft guarantees that if a log entry is committed, it will be present in the logs of all future leaders. This is ensured by the election restriction: a candidate cannot win an election unless its log contains all committed entries.

Where Raft is used: etcd (Kubernetes coordination), CockroachDB, TiKV, Consul, HashiCorp Vault.

Paxos vs. Raft: When to Use Each

Aspect	Paxos	Raft
Understandability	Difficult	Designed for clarity
Leader requirement	No (leaderless variant exists)	Yes (requires a stable leader)
Performance	Slightly more flexible	Comparable to Multi-Paxos
Implementation complexity	Very high	Moderate
Industry adoption	Legacy systems, Google	Modern systems, Kubernetes ecosystem

For most new systems, Raft is the correct choice. Use Paxos only when you need its specific properties (e.g., leaderless operation) or when integrating with existing Paxos-based infrastructure.

Real-World Example: etcd and Kubernetes

Kubernetes stores all cluster state in etcd, which uses Raft for consensus. When you run kubectl apply, the API server writes the desired state to etcd. Raft ensures this write is replicated to a majority of etcd nodes before it is acknowledged. If the etcd leader fails, Raft elects a new leader within a few hundred milliseconds, and the cluster continues operating.

A typical production etcd cluster has 3 or 5 nodes. Three nodes tolerate 1 failure. Five nodes tolerate 2 failures. More nodes increase fault tolerance but also increase write latency (more nodes must acknowledge each write).

To understand how this applies in interview scenarios, see our system design interview questions which frequently test consensus knowledge.

Replication Strategies

Replication copies data across multiple nodes to achieve fault tolerance, improve read throughput, and reduce latency by placing data closer to users. The replication strategy determines the consistency, availability, and durability characteristics of your system.

Single-Leader (Primary-Secondary) Replication

All writes go to a single leader node. The leader replicates changes to one or more follower nodes. Reads can be served by the leader (for strong consistency) or by followers (for higher throughput but potential staleness).

How it works:

Client sends a write to the leader.
Leader writes to its local storage.
Leader sends the change to all followers via a replication log.
Followers apply the change to their local storage.
Synchronous replication: leader waits for follower acknowledgment before confirming the write. Asynchronous replication: leader confirms immediately.

Advantages: Simple to understand, no write conflicts, strong consistency when reading from the leader.

Disadvantages: The leader is a single point of failure (mitigated by automatic failover), all writes must go through one node (limits write throughput), followers may serve stale reads.

Used by: PostgreSQL, MySQL, MongoDB, Redis Sentinel.

Failover challenges: When the leader fails, a follower must be promoted. This raises several issues:

If the failed leader had unreplicated writes, those writes are lost.
If the old leader comes back online, it might have conflicting data.
Split-brain: if the network partitions and both sides believe they have the leader, data diverges.

Multi-Leader Replication

Multiple nodes can accept writes independently. Each leader replicates its writes to all other leaders. This is useful for multi-datacenter deployments where you want local write latency.

How it works:

Clients in each datacenter write to their local leader.
Each leader asynchronously replicates writes to leaders in other datacenters.
Conflict resolution handles cases where two leaders modify the same data concurrently.

Conflict resolution strategies:

Last-write-wins (LWW): Use timestamps to determine which write is most recent. Simple but can lose data.
Custom merge functions: Application-specific logic to merge conflicting values.
CRDTs (Conflict-free Replicated Data Types): Data structures that can be merged automatically without conflicts. Examples: counters, sets, registers.

Used by: CouchDB, Galera Cluster for MySQL, Google Docs (operational transformation is conceptually similar).

Leaderless Replication

Any node can accept reads and writes. Clients send writes to multiple replicas simultaneously. Reads query multiple replicas and use version numbers to determine the most recent value.

How it works:

Client sends a write to all n replicas.
Write succeeds if at least w replicas acknowledge (write quorum).
Client sends a read to all n replicas.
Read succeeds if at least r replicas respond (read quorum).
If w + r > n, the read quorum and write quorum overlap, guaranteeing that at least one replica has the latest value.

Example with n=3, w=2, r=2:

A write is acknowledged after 2 of 3 replicas confirm.
A read queries 2 of 3 replicas. At least one of those 2 participated in the write quorum, so the reader sees the latest value.

Anti-entropy mechanisms:

Read repair: When a client reads from multiple replicas and detects a stale value, it writes the latest value back to the stale replica.
Anti-entropy process: A background process continuously compares replicas and copies missing data.

Used by: Amazon DynamoDB, Apache Cassandra, Riak.

Choosing a Replication Strategy

Factor	Single-Leader	Multi-Leader	Leaderless
Write latency	Higher (single writer)	Low (local leader)	Low (any node)
Write throughput	Limited by leader	Higher	Highest
Read consistency	Strong (from leader)	Eventual	Tunable (quorum)
Conflict handling	None (single writer)	Required	Required
Failover complexity	Moderate	Low (other leaders exist)	None (no leader)
Best for	Most applications	Multi-region deployments	High availability, high write throughput

For a detailed comparison of how databases implement these strategies, see our Kafka vs RabbitMQ comparison and the data pipeline architecture design which discusses replication in context.

Partitioning and Sharding

When a dataset grows beyond what a single node can handle, you must split it across multiple nodes. Partitioning (also called sharding) divides data so that each partition can be stored and processed independently.

Hash-Based Partitioning

Apply a hash function to the partition key and assign each hash range to a partition.

How it works:

Choose a partition key (e.g., user_id).
Compute hash(user_id) % num_partitions.
Route the request to the corresponding partition.

Advantages: Even distribution of data (assuming a good hash function and a partition key with high cardinality). No hotspots from sequential keys.

Disadvantages: Range queries across partition keys are impossible — you cannot efficiently query "all users with IDs 1000-2000" because they are scattered across partitions.

Used by: DynamoDB, Cassandra (with the Murmur3 partitioner), MongoDB (hashed shard key).

Range-Based Partitioning

Assign contiguous ranges of the partition key to each partition.

How it works:

Partition 1: keys A-F
Partition 2: keys G-N
Partition 3: keys O-Z

Advantages: Efficient range queries. All data for a key range is on the same partition.

Disadvantages: Risk of hotspots if the key distribution is skewed. For example, if most users have names starting with "S", that partition handles disproportionate load.

Used by: HBase, Google Bigtable, CockroachDB, TiKV.

Consistent Hashing

Consistent hashing minimizes data movement when nodes are added or removed. Nodes and keys are mapped to positions on a hash ring. Each key is assigned to the nearest node clockwise on the ring.

Why it matters: With simple modulo hashing, adding one node requires remapping almost all keys. With consistent hashing, adding a node only remaps keys from one adjacent node.

Virtual nodes: Each physical node is assigned multiple positions on the ring (virtual nodes). This improves load distribution and ensures that when a node fails, its load is spread across many other nodes rather than dumped on one.

For an in-depth explanation with diagrams, see our dedicated consistent hashing article, which covers the hash ring, virtual nodes, and how DynamoDB and Cassandra use consistent hashing in production.

Handling Hotspots

Even with good partitioning, some keys are inherently "hotter" than others. A celebrity's profile page, a viral tweet, or a popular product page concentrates traffic on one partition.

Mitigation strategies:

Key salting: Append a random suffix to hot keys (e.g., celebrity_123_01, celebrity_123_02, ..., celebrity_123_10), spreading the load across 10 partitions. Reads must fan out and merge.
Read replicas: Replicate hot partitions to multiple nodes for read scaling.
Application-level caching: Cache hot keys in an in-memory cache to reduce partition load.

Instagram's approach: When a celebrity posts, Instagram writes the post to multiple shards using key salting. Fan-out reads query all shards and merge results. This adds read complexity but prevents a single shard from being overwhelmed.

Rebalancing Partitions

As data grows or nodes are added/removed, partitions must be rebalanced:

Fixed number of partitions: Create many more partitions than nodes initially (e.g., 1000 partitions on 10 nodes). When a new node joins, it takes ownership of some partitions from existing nodes. No partition splitting required. Used by Elasticsearch, Riak, CouchBase.
Dynamic partitioning: Split partitions when they exceed a size threshold. Merge partitions when they shrink. Used by HBase, RethinkDB.
Proportional partitioning: Each node has a fixed number of partitions. When a node joins, it randomly splits existing partitions. Used by Cassandra.

Distributed Transactions: 2PC and Saga

Distributed transactions span multiple services or databases. They are essential when a single business operation must atomically update data in multiple places. The two dominant approaches are two-phase commit (2PC) for strong atomicity and the Saga pattern for eventual consistency.

Two-Phase Commit (2PC)

2PC is a blocking protocol that ensures all participants in a distributed transaction either commit or abort.

Phase 1: Prepare

The coordinator sends a Prepare message to all participants.
Each participant executes the transaction locally, writes to a write-ahead log, and responds with Yes (ready to commit) or No (abort).

Phase 2: Commit/Abort

If all participants voted Yes, the coordinator sends Commit to all.
If any participant voted No, the coordinator sends Abort to all.
Participants execute the commit or abort and acknowledge.

Problems with 2PC:

Blocking: If the coordinator crashes after sending Prepare but before sending Commit/Abort, participants are stuck holding locks indefinitely. They cannot safely commit or abort because they do not know the coordinator's decision.
Latency: Two round trips minimum, and participants hold locks during the entire protocol.
Single point of failure: The coordinator is a bottleneck and a failure point.

Three-Phase Commit (3PC) adds a "pre-commit" phase to reduce the blocking window, but it is rarely used in practice because it does not handle network partitions correctly.

Where 2PC is used: Database-internal transactions (PostgreSQL, MySQL InnoDB for distributed transactions within a single database cluster), XA transactions, Google Spanner (a heavily modified version with TrueTime).

The Saga Pattern

The Saga pattern decomposes a distributed transaction into a sequence of local transactions. Each local transaction updates a single service and publishes an event to trigger the next step. If a step fails, compensating transactions undo the previous steps.

Example: E-commerce Order

Order Service: Create order (status: PENDING)
Payment Service: Reserve payment
Inventory Service: Reserve inventory
Shipping Service: Schedule shipment
Order Service: Update order (status: CONFIRMED)

If step 3 (reserve inventory) fails:

Compensate step 2: Release payment reservation
Compensate step 1: Cancel order (status: CANCELLED)

Orchestration vs. Choreography:

Orchestration: A central saga orchestrator tells each service what to do and handles compensation. Easier to understand and debug. Single point of failure.
Choreography: Each service listens for events and decides its own next action. Decentralized but harder to trace and debug. No single point of failure.

When to use Sagas over 2PC:

Microservices architecture where each service owns its database
Long-running transactions where holding locks is impractical
When eventual consistency is acceptable
When services use different database technologies

When to use 2PC over Sagas:

Within a single database cluster
When strong atomicity is required (financial transactions)
When the transaction completes in milliseconds, not seconds

For a practical implementation of the saga pattern in a microservices context, see our article on idempotency in distributed systems, which covers how to implement idempotent compensating transactions.

Vector Clocks and Logical Time

In a distributed system, there is no global clock. Each node has its own clock, and clocks drift apart over time. Logical clocks provide a way to order events without relying on physical time.

Lamport Timestamps

Leslie Lamport's logical clocks assign a counter to each event. The rules are simple:

Each process maintains a counter C, initially 0.
Before executing an event, increment C.
When sending a message, include the current C value.
When receiving a message with timestamp T, set C = max(C, T) + 1.

Lamport timestamps provide a partial ordering: if event A happened before event B, then C(A) < C(B). However, the converse is not true — C(A) < C(B) does not mean A happened before B. The events might be concurrent.

Vector Clocks

Vector clocks extend Lamport timestamps to detect concurrent events. Each process maintains a vector of counters, one per process.

Rules:

Each process i maintains a vector V of size n (number of processes), initially all zeros.
Before executing an event, process i increments V[i].
When sending a message, include the entire vector.
When receiving a message with vector V_msg, set V[j] = max(V[j], V_msg[j]) for all j, then increment V[i].

Comparing vector clocks:

V1 < V2 (V1 happened before V2) if all elements of V1 are less than or equal to the corresponding elements of V2, and at least one is strictly less.
V1 || V2 (concurrent) if neither V1 < V2 nor V2 < V1.

Where vector clocks are used: Amazon Dynamo (original version) used vector clocks for conflict detection. When two writes conflict, the vector clocks tell you whether one is strictly newer or whether they are concurrent (requiring conflict resolution).

Limitations: Vector clocks grow with the number of processes. In a system with thousands of nodes, the vectors become unwieldy. This is why DynamoDB replaced vector clocks with simpler last-writer-wins based on server-side timestamps.

Hybrid Logical Clocks (HLC)

Hybrid logical clocks combine physical timestamps with a logical component. They provide the ordering guarantees of logical clocks while staying close to real time. CockroachDB and MongoDB use HLCs.

The HLC tracks (physical_time, logical_counter). The physical component is the node's wall clock, and the logical component breaks ties when multiple events have the same physical timestamp.

Gossip Protocols

Gossip protocols (also called epidemic protocols) disseminate information through a cluster by having nodes periodically exchange state with random peers. They are decentralized, fault-tolerant, and eventually consistent.

How Gossip Works

Each node maintains some local state (e.g., cluster membership, heartbeat counters, application data).
Periodically (every 1-2 seconds), each node selects a random peer and sends its state.
The receiving node merges the incoming state with its own (using timestamps or version numbers to keep the latest values).
Over multiple rounds, information spreads to all nodes exponentially fast.

The mathematical property that makes gossip efficient: in a cluster of n nodes, information reaches all nodes in O(log n) rounds of gossip. For a 1000-node cluster, it takes about 10 rounds (10-20 seconds) for all nodes to learn about a change.

Types of Gossip

Anti-entropy: Nodes periodically compare their entire state and reconcile differences. Used for data repair.
Rumor mongering: Nodes spread a specific piece of new information until it is stale. More efficient than anti-entropy for propagating updates.
Aggregation: Nodes compute cluster-wide aggregates (average load, total storage used) through gossip exchanges.

Real-World Uses

Cassandra uses gossip for cluster membership and failure detection. Each node gossips its heartbeat counter and schema version to random peers every second. If a node's heartbeat counter stops incrementing, other nodes mark it as potentially down.

Consul uses the SWIM (Scalable Weakly-consistent Infection-style Membership) protocol, a gossip-based membership protocol. SWIM combines gossip with direct probing for faster failure detection.

Amazon DynamoDB uses gossip to propagate partition assignment and membership information across storage nodes.

Trade-offs

Advantages: No central coordinator, tolerates node failures, scales to large clusters, simple to implement.

Disadvantages: Eventual consistency only (not immediate), bandwidth overhead from periodic exchanges, convergence time depends on gossip interval and cluster size.

Leader Election

Many distributed systems require a single leader to coordinate activities — writing to a database, scheduling jobs, or managing locks. Leader election algorithms ensure that exactly one node serves as leader at any given time, and a new leader is chosen when the current leader fails.

Bully Algorithm

The bully algorithm elects the node with the highest ID.

When a node detects the leader has failed, it sends an Election message to all nodes with higher IDs.
If no higher-ID node responds, it declares itself leader.
If a higher-ID node responds, that node takes over the election process.
The highest-ID node that responds becomes the leader.

Problem: If the highest-ID node is slow or flaky, it may repeatedly win elections but fail to serve as an effective leader.

Raft Leader Election

Raft uses a term-based leader election (described in the Consensus Algorithms section above). The key insight is that Raft combines leader election with log consistency — a candidate cannot win an election unless its log is at least as up-to-date as a majority's.

Leader Election with Distributed Locks

A practical approach used in many production systems: use a distributed lock (Redis, ZooKeeper, etcd) with a TTL.

All nodes attempt to acquire a lock with a TTL (e.g., 30 seconds).
The node that acquires the lock is the leader.
The leader periodically renews the lock before the TTL expires.
If the leader crashes, the lock expires, and another node acquires it.

python

Fencing tokens: A critical concept for preventing split-brain scenarios. When a leader acquires the lock, it receives a monotonically increasing fencing token. All requests to the downstream resource include this token. The resource rejects requests with a token lower than the highest token it has seen, preventing a stale leader from making changes after a new leader has been elected.

Our article on building a distributed job scheduler demonstrates leader election with fencing tokens in a production context.

Failure Detection

Detecting whether a remote node has failed is fundamentally impossible to do perfectly — you cannot distinguish between a crashed node and a very slow network. Failure detectors make probabilistic assessments with tunable accuracy.

Heartbeat-Based Detection

The simplest approach: nodes send periodic heartbeats. If a node misses k consecutive heartbeats, it is declared dead.

Parameters:

Heartbeat interval: How often heartbeats are sent (e.g., every 1 second).
Timeout: How many missed heartbeats trigger failure detection (e.g., 3 missed = 3 seconds).

Trade-off: A short timeout detects failures quickly but produces more false positives (declaring a slow node as dead). A long timeout reduces false positives but delays failure detection.

Phi Accrual Failure Detector

Used by Cassandra and Akka. Instead of a binary alive/dead decision, the phi accrual detector computes a "suspicion level" (phi) based on the statistical distribution of heartbeat arrival times.

Track the inter-arrival times of heartbeats from each node.
Compute the probability that the current delay is due to a crash rather than normal variation.
Express this as a phi value: phi = -log10(probability that the node is alive).
If phi exceeds a threshold (typically 8), declare the node dead.

Why it is better: The phi detector adapts to network conditions. If heartbeats normally arrive with high variance (e.g., across continents), the detector adjusts its threshold automatically instead of producing false positives.

SWIM Protocol

SWIM (Scalable Weakly-consistent Infection-style Membership) combines failure detection with gossip-based membership.

Each node periodically selects a random peer and sends a Ping.
If the peer does not respond within a timeout, the node selects k other peers and asks them to ping the unresponsive node (Ping-req).
If none of the k peers get a response, the node is declared suspect.
Suspect status is disseminated through gossip. If the suspect node does not refute the suspicion within a timeout, it is declared dead.

Advantages: O(1) message overhead per node per period (constant, not proportional to cluster size). False positive rate is configurable via the k parameter.

Used by: HashiCorp Consul (Serf library), Memberlist.

Distributed Caching

Caching is the most common performance optimization in distributed systems. A well-designed cache can reduce database load by 90%+ and cut response times from hundreds of milliseconds to single-digit milliseconds.

Cache Patterns

Cache-Aside (Lazy Loading)

Application checks the cache first.
On a cache miss, the application reads from the database.
The application writes the value to the cache.
Subsequent reads are served from the cache.

This is the most commonly used pattern. The application controls what gets cached and when. Stale data is possible if the database is updated without invalidating the cache.

Write-Through

Application writes to the cache.
The cache synchronously writes to the database.
Reads always come from the cache.

Data is always consistent between cache and database, but writes are slower (two writes per operation) and cold start requires pre-warming the cache.

Write-Behind (Write-Back)

Application writes to the cache.
The cache asynchronously writes to the database (batched, delayed).
Reads always come from the cache.

Writes are fast but data can be lost if the cache crashes before flushing to the database.

Cache Invalidation

Cache invalidation is famously one of the two hard problems in computer science (along with naming things and off-by-one errors). Common strategies:

TTL (Time-to-Live): Entries expire after a fixed duration. Simple but allows stale reads until expiration.
Event-based invalidation: When the database is updated, publish an event that invalidates the corresponding cache entry. Consistent but requires infrastructure (message queue, CDC).
Version-based invalidation: Include a version number with each cache entry. When the database row is updated, increment the version. Cache entries with old versions are treated as misses.

Distributed Cache Architecture

Partitioned Cache (Sharded)

Cache entries are distributed across nodes using consistent hashing. Each key is stored on one node. This scales horizontally — more nodes means more cache capacity.

Replicated Cache

Every cache node has a copy of every entry. Reads are fast (any node can serve), but writes must be propagated to all nodes. Does not scale for large datasets.

Real-World: How Facebook/Meta Caches at Scale

Meta operates one of the largest Memcached deployments in the world (trillions of requests per day). Key architectural decisions:

Regional pools: Cache is divided into regional pools. A cache miss in one region does not go to another region — it goes directly to the database. This prevents cross-region cache stampedes.
Lease mechanism: When a cache miss occurs, the client receives a "lease" (a token) from the cache server. The lease prevents thundering herd: if 1000 requests miss the cache simultaneously, only the first one gets a lease to fill the cache. The other 999 wait.
McRouter: A proxy layer that handles routing, connection pooling, and failover for Memcached. Applications talk to McRouter, not directly to Memcached nodes.

For comparison of caching technologies, see our Kafka vs RabbitMQ comparison for how message-based invalidation works.

Message Queues and Event Streaming

Message queues decouple producers from consumers, enabling asynchronous processing, load leveling, and fault tolerance. The choice between a traditional message broker and an event streaming platform has significant architectural implications.

Message Queues vs. Event Streams

Message Queue (RabbitMQ, Amazon SQS):

Messages are delivered to one consumer (competing consumers pattern).
Messages are deleted after acknowledgment.
No replay capability.
Best for: task distribution, work queues, request-reply patterns.

Event Stream (Apache Kafka, Amazon Kinesis):

Events are appended to a durable, ordered log.
Multiple consumers can read the same events independently (consumer groups).
Events are retained for a configurable duration (days, weeks, forever).
Full replay capability — consumers can seek to any offset.
Best for: event sourcing, CDC, stream processing, audit logs.

Delivery Guarantees

At-most-once: Messages may be lost but are never delivered twice. The producer sends and forgets. Used when occasional data loss is acceptable (metrics, logging).

At-least-once: Messages are never lost but may be delivered multiple times. The producer retries until acknowledgment. Consumers must be idempotent. This is the default for most systems.

Exactly-once: Messages are delivered exactly once. Extremely difficult to achieve in practice. Kafka supports exactly-once semantics within a Kafka-to-Kafka pipeline using transactional producers and idempotent consumers. For cross-system exactly-once, you typically combine at-least-once delivery with idempotent processing.

Ordering Guarantees

Kafka: Ordered within a partition. Use a partition key to ensure related events go to the same partition.
RabbitMQ: FIFO within a single queue. No ordering across queues.
SQS Standard: No ordering guarantee. SQS FIFO: ordered within a message group.

Backpressure

When consumers are slower than producers, the system needs a mechanism to prevent unbounded queue growth:

Kafka: Consumer-controlled pull model. Consumers read at their own pace. The log accumulates until retention policy kicks in.
RabbitMQ: Prefetch count limits the number of unacknowledged messages per consumer. If the prefetch limit is reached, the broker stops delivering.
Application-level: Circuit breakers, rate limiting, and load shedding at the producer level.

Real-World: How LinkedIn Uses Kafka

LinkedIn (where Kafka was created) uses it as the central nervous system for data flow:

Over 7 trillion messages per day.
Connects hundreds of microservices.
Feeds real-time analytics, ad targeting, abuse detection, and search indexing.
Uses Kafka as the source of truth for event-sourced data, with derived views materialized from the event log.

Our dedicated article on choosing between Kafka, RabbitMQ, and NATS provides a detailed decision framework with benchmarks.

How to Study This Material

Distributed systems is a vast field. Here is a structured approach to learning it effectively:

Phase 1: Foundations (2-3 weeks)

Start with the concepts that everything else builds on:

CAP theorem — understand the real trade-off (not "pick two").
Consistency models — know the spectrum from linearizability to eventual consistency.
Replication — understand single-leader, multi-leader, and leaderless replication.
Partitioning — understand hash-based vs. range-based and consistent hashing.

For each topic, be able to explain it in your own words, give a real-world example, and discuss trade-offs.

Phase 2: Algorithms and Protocols (2-3 weeks)

Dive into the algorithms that make distributed systems work:

Raft consensus — understand leader election, log replication, and safety. Try the interactive Raft visualization at raft.github.io.
Vector clocks — work through examples by hand.
Gossip protocols — understand why they scale and where they are used.
Failure detection — understand the phi accrual detector and SWIM.

Phase 3: Patterns and Practice (2-3 weeks)

Apply the concepts to real system design problems:

2PC and Saga — understand when to use each.
Distributed caching — design a caching layer for a read-heavy system.
Message queues — design an event-driven architecture.
Practice system design interview questions that test distributed systems knowledge.

Phase 4: Real Systems (ongoing)

Read engineering blog posts from companies that operate at scale:

Google (Spanner, Bigtable, MapReduce papers)
Amazon (Dynamo, Aurora papers)
Meta (TAO, Memcache, ZippyDB)
Netflix (microservices, chaos engineering)

Study how Netflix scales to understand how these concepts are applied in practice.

Related Resources

Algoroq Concept Deep Dives

CAP Theorem Explained — common misconceptions and interview discussion strategies
Consistent Hashing in Practice — hash ring, virtual nodes, and real-world implementations

System Design Practice

System Design Interview Questions — practice problems that test distributed systems knowledge
How to Design a URL Shortener — apply partitioning and caching concepts
System Design Interview Guide — framework and strategy for system design interviews

Technology Comparisons

Kafka vs RabbitMQ — detailed comparison for messaging system selection

Company Engineering

Google System Design Interview Guide — how Google evaluates distributed systems knowledge
How Netflix Scales — real-world distributed systems architecture

Career Growth

Senior to Staff Engineer — how distributed systems expertise supports career advancement

Learning Paths

Algoroq Live Cohort — 12-week program covering distributed systems with live instruction
Self-Paced Learning — study distributed systems at your own pace with structured content

Distributed Systems Fundamentals

Distributed Systems Fundamentals

Table of Contents

CAP Theorem and the Reality of Distributed Trade-offs

What CAP Actually Means

Beyond CAP: The PACELC Framework

Real-World Example: Amazon DynamoDB

Real-World Example: Google Spanner

Trade-offs and When to Choose

Consistency Models

Strong Consistency (Linearizability)

Sequential Consistency

Causal Consistency

Eventual Consistency

Read-Your-Writes Consistency

Monotonic Reads

How Google, Netflix, and Amazon Choose Consistency Models

Consensus Algorithms: Raft and Paxos

The Consensus Problem

Paxos

Raft

Paxos vs. Raft: When to Use Each

Real-World Example: etcd and Kubernetes

Replication Strategies

Single-Leader (Primary-Secondary) Replication

Multi-Leader Replication

Leaderless Replication

Choosing a Replication Strategy

Partitioning and Sharding

Hash-Based Partitioning

Range-Based Partitioning

Consistent Hashing

Handling Hotspots

Rebalancing Partitions

Distributed Transactions: 2PC and Saga

Two-Phase Commit (2PC)

The Saga Pattern

Vector Clocks and Logical Time

Lamport Timestamps

Vector Clocks

Hybrid Logical Clocks (HLC)

Gossip Protocols

How Gossip Works

Types of Gossip

Real-World Uses

Trade-offs

Leader Election

Bully Algorithm

Raft Leader Election

Leader Election with Distributed Locks

Failure Detection

Heartbeat-Based Detection

Phi Accrual Failure Detector

SWIM Protocol

Distributed Caching

Cache Patterns

Cache Invalidation

Distributed Cache Architecture

Real-World: How Facebook/Meta Caches at Scale

Message Queues and Event Streaming

Message Queues vs. Event Streams

Delivery Guarantees

Ordering Guarantees

Backpressure

Real-World: How LinkedIn Uses Kafka

How to Study This Material

Phase 1: Foundations (2-3 weeks)

Phase 2: Algorithms and Protocols (2-3 weeks)

Phase 3: Patterns and Practice (2-3 weeks)

Phase 4: Real Systems (ongoing)

Recommended Reading Order

Related Resources

Algoroq Concept Deep Dives

System Design Practice

Technology Comparisons

Company Engineering

Career Growth

Learning Paths

Learn from senior engineers in our 12-week cohort