SYSTEM_DESIGN
System Design: End-to-End Encrypted Chat
System design of an end-to-end encrypted chat application covering Signal Protocol, Double Ratchet algorithm, key distribution, multi-device support, and encrypted group messaging.
Requirements
Functional Requirements:
- End-to-end encrypted messaging where the server cannot read message content
- One-on-one and group conversations with forward secrecy
- Multi-device support (messages decrypt on all user devices)
- Key verification mechanism for users to validate each other's identity
- Encrypted media transfer (images, videos, files)
- Message delivery when recipient is offline
Non-Functional Requirements:
- Encryption adds no more than 50ms to message send/receive latency
- Key exchange must complete within a single round trip
- 99.99% availability of the key distribution service
- Compromise of a single message key must not reveal past or future messages (forward secrecy + future secrecy)
- Support for 100M+ users with millions of concurrent sessions
Scale Estimation
With 100M users and 20 billion messages per day, the system processes approximately 230,000 messages per second. Each message includes an encrypted payload (~256 bytes for text plus 80 bytes of protocol overhead: message key header, ratchet public key, counter). The Key Distribution Service stores 3 key types per user per device: 1 Identity Key (32 bytes), 1 Signed Pre-Key (32 bytes + 64 bytes signature), and 100 One-Time Pre-Keys (32 bytes each) — totaling ~3.4KB per device. With 100M users averaging 2 devices each, key storage is approximately 680GB. Key fetches average 500K per second for new session establishment.
High-Level Architecture
The system architecture has three security-critical components: the Key Distribution Service, the Message Relay, and the client-side cryptographic engine. The Key Distribution Service (KDS) stores public key material for all users — it never sees private keys. When Alice wants to message Bob for the first time, she fetches Bob's pre-key bundle from the KDS and performs the X3DH (Extended Triple Diffie-Hellman) key agreement entirely on her device to establish a shared secret. From this shared secret, the Double Ratchet algorithm derives a chain of message keys, each used for exactly one message.
The Message Relay is intentionally simple: it receives encrypted message blobs from senders and routes them to recipients without any ability to decrypt the content. The relay stores messages for offline recipients in an encrypted message queue (it stores ciphertext, not plaintext). When the recipient comes online, the relay delivers queued messages. The relay sees metadata (who is messaging whom, timestamps, message sizes) but not content — metadata protection requires additional layers like sealed sender (used by Signal).
The client-side cryptographic engine implements the Signal Protocol. It manages the local key store (identity key pair, session state for each contact, pre-key pairs), performs X3DH for session establishment, runs the Double Ratchet for ongoing message encryption, and handles group messaging via the Sender Keys protocol. All cryptographic operations use Curve25519 for key agreement, AES-256-CBC for symmetric encryption, and HMAC-SHA256 for message authentication.
Core Components
Key Distribution Service (KDS)
The KDS stores and serves public key bundles. Each user registers: (1) a long-term Identity Key (Curve25519 public key), (2) a Signed Pre-Key (rotated every 7 days, signed by the Identity Key to prove ownership), and (3) a batch of 100 One-Time Pre-Keys (each used exactly once, consumed on first contact). The KDS is backed by a Cassandra cluster with partition key user_id and clustering key device_id. When a pre-key bundle is fetched, one One-Time Pre-Key is atomically consumed (deleted) using a Cassandra lightweight transaction to prevent reuse. When the supply runs low, the client uploads fresh One-Time Pre-Keys.
Double Ratchet Engine
The Double Ratchet algorithm provides forward secrecy and future secrecy (also called break-in recovery). It combines two ratchets: a Diffie-Hellman ratchet (new DH key pair generated on every message exchange turn) and a symmetric key ratchet (KDF chain that derives a new message key for each message). When Alice sends a message, she: (1) advances her DH ratchet by generating a new ephemeral Curve25519 key pair, (2) derives a new root key and chain key from the DH shared secret, (3) derives a message key from the chain key, (4) encrypts the message with AES-256-CBC using the message key, and (5) attaches her new ephemeral public key to the message header. Bob reverses the process on receipt.
Sealed Sender (Metadata Protection)
To protect sender identity from the server, the system implements sealed sender delivery. The sender encrypts not just the message content but also the sender identity inside the encrypted envelope. The outer envelope contains only the recipient identifier and the ciphertext. The server routes based on the recipient but cannot determine who sent the message. This is achieved by having the sender encrypt their identity along with the message payload using the recipient's identity key, creating a two-layer encryption scheme.
Database Design
The Key Distribution Service uses Cassandra with two tables. The identity_keys table: partition key user_id, columns device_id, identity_public_key (32 bytes), signed_prekey (32 bytes), signed_prekey_signature (64 bytes), signed_prekey_id, registered_at. The one_time_prekeys table: partition key (user_id, device_id), clustering key prekey_id, column prekey_public (32 bytes). One-time pre-keys are deleted upon consumption using DELETE FROM one_time_prekeys WHERE user_id = ? AND device_id = ? AND prekey_id = ?.
The encrypted message store uses Cassandra with partition key recipient_user_id, clustering key (device_id, timestamp). Columns: sender_encrypted_header (sealed sender blob), ciphertext, message_type (prekey_message or standard_message). Messages are deleted after successful delivery acknowledgment. A TTL of 30 days ensures undelivered messages don't accumulate indefinitely. No indexes exist on message content because the server literally cannot read it — all queries are by recipient only.
API Design
PUT /api/v1/keys/register— Upload key bundle:{identity_key, signed_prekey, signed_prekey_signature, one_time_prekeys[]}; called on first registration and to replenish pre-keysGET /api/v1/keys/{user_id}/{device_id}— Fetch pre-key bundle for session establishment; atomically consumes one one-time pre-keyPUT /api/v1/messages/{recipient_id}— Send encrypted message:{device_messages: [{device_id, type, ciphertext, sealed_sender_header}]}; server stores and routes without decryptionGET /api/v1/messages— Fetch pending encrypted messages for the authenticated user; returns array of ciphertext blobs
Scaling & Bottlenecks
The Key Distribution Service is the critical bottleneck. Every new conversation requires a pre-key fetch, and popular users (celebrities, support accounts) may have thousands of new conversations per hour, rapidly depleting their one-time pre-key supply. If the supply is exhausted, sessions fall back to using only the Signed Pre-Key (without a one-time pre-key), which weakens the security guarantee from forward secrecy to forward secrecy only after the first response. The mitigation is aggressive pre-key replenishment: clients upload 100 new pre-keys whenever the server count drops below 20.
Multi-device support creates message multiplication: a message sent to a user with 3 devices must be encrypted 3 times (once per device, since each device has its own identity key and session state). For a group of 50 members with 2 devices each, a single message requires 100 encryption operations on the sender's device and 100 ciphertext blobs transmitted and stored. The Sender Keys optimization for groups reduces this to O(1) encryption (one symmetric key for the whole group) plus O(N) Sender Key distribution messages sent via pairwise sessions only when the group membership changes.
Key Trade-offs
- Signal Protocol over custom encryption: Signal Protocol is the gold standard, peer-reviewed, and used by WhatsApp, Signal, and Facebook Messenger; the trade-off is implementation complexity (managing ratchet state, pre-key lifecycle) vs. a simpler symmetric encryption scheme
- One-Time Pre-Keys for forward secrecy: Consuming a unique pre-key per new session ensures that compromising a key bundle doesn't allow decrypting past sessions, but requires ongoing pre-key replenishment and atomic consumption to prevent reuse
- Sender Keys for group encryption over pairwise encryption: Sender Keys reduce group encryption from O(N) to O(1) per message, but require re-keying when any member leaves (to ensure the departed member can't decrypt future messages), which is expensive for large groups
- Metadata protection (sealed sender) vs. simplicity: Hiding sender identity from the server adds a layer of privacy but doubles the encryption overhead (two-layer envelope) and prevents the server from enforcing sender-based policies like spam filtering
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.