Achieving Exactly-Once Delivery in Kafka: Theory vs Practice

The distributed systems textbook says exactly-once delivery is impossible. Kafka says they've done it. Both are right — the nuance lies in what "exactly-once" actually means and the boundaries within which Kafka's guarantees hold.

The Theory: Why Exactly-Once Is "Impossible"

The Two Generals' Problem proves that no protocol can guarantee message delivery in the presence of unreliable communication. A producer sends a message, the broker stores it, but the acknowledgment is lost. The producer doesn't know if the message was stored or not. It retries — and now the message might be duplicated.

Two Generals problem — ACK lost, producer retries, broker stores duplicate

Kafka doesn't solve the Two Generals' Problem. Instead, it narrows the scope: exactly-once within the Kafka ecosystem, between producers and consumers connected to the same Kafka cluster. Once a message leaves Kafka and hits an external system (database, API), you're back to at-least-once.

Idempotent Producers: Deduplication at the Broker

Kafka's first exactly-once building block is the idempotent producer. When enabled, each producer gets a unique Producer ID (PID), and each message gets a monotonically increasing sequence number. The broker deduplicates by tracking (PID, partition, sequence_number).

Idempotent producer — broker deduplicates using PID and sequence numbers

java

With idempotence enabled, the producer can safely retry without creating duplicates. The broker sees "I already have sequence 42 from PID 7 on partition 3" and silently drops the duplicate.

What idempotent producers don't do: They prevent duplicates within a single producer session. If the producer crashes and restarts, it gets a new PID, and the deduplication state resets. For cross-session guarantees, you need transactions.

Transactions: Atomic Multi-Partition Writes

Kafka transactions let a producer atomically write to multiple partitions and commit consumer offsets. Either all writes succeed or none of them are visible.

Kafka transaction protocol — InitProducerId through CommitTransaction

java

The transactional.id is persistent across restarts. When a producer starts with a transactional ID that was previously in use, Kafka fences the old producer (any in-flight transactions from the old instance are aborted). This prevents zombie producers from creating duplicates after failover.

How Transactions Work Internally

Kafka uses a transaction coordinator (a broker designated for each transactional.id) and a transaction log (__transaction_state topic):

Records written during a transaction are physically present on the partition but marked as uncommitted. They become visible only after the commit marker is written. Consumers with isolation.level=read_committed skip uncommitted records.

Consumer-Side: read_committed

On the consumer side, you need to opt into transactional guarantees:

Read committed and LSO — in-flight transactions block consumers

java

With read_committed, the consumer's position advances to the Last Stable Offset (LSO) — the offset of the earliest in-flight (uncommitted) transaction. This means a long-running transaction blocks downstream consumers from reading anything past its start offset, even committed records that come after it.

Implication: Keep transactions short. A transaction that runs for 30 seconds blocks all read_committed consumers on that partition for 30 seconds.

The Consume-Transform-Produce Pattern

The canonical exactly-once pattern in Kafka: read from input topic, process, write to output topic, and commit consumer offsets — all in one atomic transaction.

Consume-transform-produce — the canonical exactly-once processing loop

java

This guarantees that each input record produces exactly one output record. If the transaction aborts, both the output records and the offset commit are rolled back, so the consumer re-reads and re-processes.

Performance Overhead

Transactions aren't free. Here's what they cost:

Metric	Without Transactions	With Transactions	Overhead
Producer throughput	850K msgs/sec	600K msgs/sec	~30%
Producer latency (p50)	2ms	5ms	+3ms
Producer latency (p99)	15ms	45ms	+30ms
Consumer throughput	1.2M msgs/sec	900K msgs/sec	~25%

Benchmarked on 3-broker cluster, replication factor 3, 6 partitions, message size 1KB.

The overhead comes from:

Extra RPCs to the transaction coordinator
Transaction markers written to each partition
Consumer buffering until commit markers arrive

When At-Least-Once Is Better

Exactly-once in Kafka only covers Kafka-to-Kafka processing. The moment you write to an external system (database, API, cache), the guarantee breaks:

EOS boundary — exactly-once within Kafka breaks at external systems

For most real-world applications, the answer is: use at-least-once delivery with idempotent consumers. Design your consumer's side effects to be idempotent:

python

This is simpler, faster (no transaction overhead), and covers the external-system case that Kafka transactions can't. Kafka transactions are the right choice for Kafka Streams applications and consume-transform-produce pipelines that stay within Kafka. For everything else, invest in idempotent consumers.

Configuration Pitfalls

Common misconfigurations that silently break exactly-once:

acks not set to all: Idempotent producers require acks=all. Without it, a leader failure before replication can lose the deduplication state.
transaction.timeout.ms too low: Default is 60 seconds. If your processing takes longer, the transaction aborts. Set it to match your worst-case processing time plus margin.
Reusing transactional.id across unrelated producers: Each transactional ID should map to exactly one logical producer. Sharing IDs causes fencing storms.
Not handling ProducerFencedException: This means another producer instance took over your transactional ID. Don't retry — shut down the old instance.
Consumer auto.commit enabled with transactions: Auto-commit bypasses the transactional offset commit. Always disable it when using consume-transform-produce.

Kafka's exactly-once is real, but it's exactly-once within Kafka. Know the boundaries, measure the overhead, and consider whether idempotent at-least-once covers your actual requirements. Most of the time, it does.

Achieving Exactly-Once Delivery in Kafka: Theory vs Practice

Achieving Exactly-Once Delivery in Kafka: Theory vs Practice

The Theory: Why Exactly-Once Is "Impossible"

Idempotent Producers: Deduplication at the Broker

Transactions: Atomic Multi-Partition Writes

How Transactions Work Internally

We build this end-to-end in the cohort.

Consumer-Side: read_committed

The Consume-Transform-Produce Pattern

Performance Overhead

When At-Least-Once Is Better

Configuration Pitfalls

More in System Design

Building a Distributed Job Scheduler

Consistent Hashing in Practice

become an engineering leader