System Design: Event Streaming Platform

Requirements

Functional Requirements:

Producers publish events to named topics with configurable partitioning (key-based, round-robin, or custom)
Consumers subscribe to topics and read events in order within each partition using consumer groups
Events are durably stored for a configurable retention period (time-based or size-based)
Exactly-once delivery semantics for producers and consumers with transactional support
Topic management: create, delete, configure retention, partition count, replication factor
Schema registry for event schema evolution with backward/forward compatibility checks

Non-Functional Requirements:

Handle 5 million events/sec write throughput with P99 latency under 10ms
Support 100K topics with up to 1,000 partitions each
99.999% availability for the event log (five-nines — core infrastructure)
Zero data loss: every acknowledged event must survive any single node failure
Horizontal scalability: adding brokers increases throughput linearly

Scale Estimation

5M events/sec at average 1KB per event = 5GB/sec ingest throughput. With replication factor 3: 15GB/sec total write throughput across the cluster. Daily data volume: 5GB/sec × 86,400 = 432TB/day. At 7-day retention: 3PB stored data. Read throughput (consumers typically read at 2-3x write throughput due to multiple consumer groups): 10-15GB/sec. Disk IOPS: sequential writes dominate; 100 brokers with NVMe SSDs provide ample throughput. Network: 15GB/sec = 120 Gbps write + 120 Gbps replication + 100 Gbps read = ~340 Gbps aggregate cluster bandwidth. Metadata: 100K topics × 100 partitions average = 10M partition metadata entries.

High-Level Architecture

The platform follows a distributed commit log architecture (similar to Apache Kafka). The Broker Layer consists of 100+ broker nodes, each managing a subset of topic-partitions. Each partition is an append-only log stored on disk as a sequence of segment files (1GB each). Writes are sequential (append only), which is critical for performance — sequential disk I/O on modern NVMe SSDs delivers 2-3 GB/sec per disk. Each partition has a configurable replication factor (typically 3): one leader and two followers. Producers write to the leader; followers replicate by fetching from the leader.

The Controller handles cluster metadata and partition assignment. A single controller node (elected via Raft consensus among a controller quorum of 3-5 nodes) maintains the metadata log: topic configurations, partition assignments (which broker is leader for which partition), consumer group offsets, and ACLs. When a broker fails, the controller reassigns its leader partitions to in-sync replicas (ISR) within seconds. The metadata log itself is replicated across the controller quorum for fault tolerance.

The Client Layer includes producer and consumer libraries. The producer client partitions events (using a hash of the event key modulo partition count), batches events per partition for efficiency (configurable batch.size and linger.ms), optionally compresses batches (LZ4 or Zstandard), and sends them to the partition leader. The consumer client joins a consumer group, is assigned a subset of partitions by the group coordinator (a broker designated for that group), and reads events sequentially from each assigned partition, committing offsets to mark progress.

Core Components

Distributed Log Storage

Each partition is stored as a sequence of segment files on disk. The active segment receives new writes (append only); when it reaches 1GB or a time threshold, it is closed and a new segment begins. Each segment has an associated index file mapping event offsets to file positions, enabling O(1) offset lookups via binary search on the index. Data is written to the OS page cache and flushed to disk periodically (or on fsync for durability guarantees). The OS page cache effectively serves as a read cache — recent events (the hot tail) are served from memory without hitting disk. Compacted topics (for changelog/state store use cases) run a background compaction process that retains only the latest value per key, removing obsolete entries.

Replication & ISR Protocol

Replication uses a leader-follower model with in-sync replica (ISR) tracking. Followers continuously fetch from the leader. A follower is in the ISR if it has replicated up to the leader's log end offset within a configurable lag threshold (replica.lag.time.max.ms = 10 seconds). The leader tracks the high watermark (HW) — the latest offset replicated to all ISR members. Consumers only see events up to the HW, ensuring they never read data that could be lost if the leader fails. When a leader fails, the controller elects a new leader from the ISR, guaranteeing zero data loss (assuming ISR has >= 1 member). Producers using acks=all wait for all ISR members to replicate before receiving an acknowledgment.

Exactly-Once Semantics

Exactly-once delivery is achieved through two mechanisms. Producer idempotency: each producer is assigned a producer ID (PID) by the broker. Each event sent by the producer includes the PID and a monotonically increasing sequence number. The broker deduplicates events with the same PID + sequence number, preventing duplicates from retries. Transactional producing: a producer can begin a transaction, write events to multiple partitions atomically, and commit. Internally, this uses a two-phase commit protocol coordinated by a Transaction Coordinator (a broker role). Consumer exactly-once: consumers commit offsets within the same transaction as their output writes, ensuring processing and offset commit are atomic. This requires the consumer output to also be the event streaming platform (read-process-write pattern).

Database Design

The platform does not use a traditional database — the event log itself is the primary data store. Metadata is stored in the controller's replicated log (Raft-based). Key metadata structures: TopicMetadata (topic_name, partition_count, replication_factor, retention_ms, retention_bytes, cleanup_policy, configs), PartitionMetadata (topic, partition_id, leader_broker, isr_brokers[], replicas[], leader_epoch), ConsumerGroupMetadata (group_id, state, members[], partition_assignments[], committed_offsets{}).

Committed consumer offsets are stored in a special internal topic (__consumer_offsets) partitioned by group_id hash. This self-hosting approach means offset storage scales with the platform itself. The schema registry (a separate stateful service) stores schemas in a PostgreSQL database or an internal Kafka topic: schemas (schema_id, subject, version, schema_type ENUM(avro, protobuf, json_schema), schema_text, compatibility_mode, created_at). Compatibility checking (backward, forward, full) validates new schema versions against existing ones before registration.

API Design

POST /topics/{topic}/produce — Produce events to a topic; body contains key, value, headers, partition (optional); returns offset and partition
GET /topics/{topic}/partitions/{partition}/events?offset={offset}&max_bytes=1048576 — Fetch events from a partition starting at an offset
POST /consumer-groups/{group}/subscribe — Subscribe a consumer to topics; body contains topics list and consumer config
PUT /consumer-groups/{group}/offsets — Commit consumer offsets; body contains topic-partition-offset mappings

Scaling & Bottlenecks

Network bandwidth is the primary bottleneck at scale. With replication factor 3, every byte written is transmitted 3 times (producer → leader, leader → follower1, leader → follower2) plus read by consumers. A single broker with a 25 Gbps NIC can handle ~800 MB/sec of producer writes (accounting for replication overhead). 100 brokers provide ~80 GB/sec aggregate producer throughput, well above the 5 GB/sec requirement. Adding brokers increases throughput linearly because partitions are distributed across brokers. Partition reassignment during broker addition uses a throttled transfer process to avoid saturating network links.

Partition count is the scalability knob for parallelism: more partitions enable more concurrent producers and consumers. However, partition count has costs: each partition requires memory for index buffers (~10MB), and leader election during broker failure takes O(partitions) time. At 10M partitions across the cluster, a broker failure (hosting 100K partitions) requires 100K leader elections — optimized via batch election (all partitions on a failed broker are reassigned in a single metadata update). The controller's metadata log grows linearly with partition count; the Raft-based controller handles 10M partition entries with sub-second failover.

Key Trade-offs

Append-only log vs mutable storage: The append-only architecture enables sequential I/O (critical for throughput) and simple replication, but means deleted events still occupy disk until segment compaction — retention policies manage this trade-off
ISR-based replication vs quorum-based: ISR provides zero data loss with acks=all while adapting to slow replicas (they fall out of the ISR rather than blocking writes), but ISR shrinkage to 1 member means a single node failure can cause data loss — min.insync.replicas=2 prevents this at the cost of reduced write availability
Exactly-once semantics vs at-least-once: Exactly-once adds per-event sequence numbers and transactional overhead (~10% throughput reduction), but eliminates the need for downstream deduplication — worthwhile for financial and critical data pipelines
Single controller vs distributed metadata: A single controller (elected via Raft) simplifies metadata management but is a potential bottleneck for large clusters — the Raft quorum ensures sub-second failover, and metadata operations are infrequent compared to data operations