System Design: Email System (Gmail-scale)

Requirements

Functional Requirements:

Send and receive emails via SMTP with support for attachments up to 25MB
Organize emails with labels, folders, stars, and filters
Full-text search across all email content, attachments, and metadata
Spam and phishing detection with near-zero false positive rate
Conversation threading that groups related emails
Multi-device access via IMAP, POP3, and web/mobile clients

Non-Functional Requirements:

1.8 billion active accounts; 300 billion emails sent/received per day globally
Email delivery latency under 10 seconds for 99th percentile
99.99% availability; zero email loss (regulatory and legal implications)
Search latency under 500ms even for accounts with millions of emails
15GB free storage per account with efficient deduplication

Scale Estimation

300 billion emails per day translates to 3.47 million emails per second. Average email size (including headers, body, and small attachments) is approximately 75KB, producing ~22.5PB of new email data per day. With 1.8 billion accounts at 15GB each, total storage capacity is 27 exabytes — requiring massive distributed storage. The spam filter processes all inbound email: roughly 150 billion inbound emails/day, of which 45% is spam (filtered before inbox). Search indexes grow by approximately 5PB per day. Attachment deduplication (common in enterprise: many recipients receive the same attachment) reduces storage by an estimated 30%.

High-Level Architecture

The email system has four major subsystems: ingestion, storage, delivery, and search. The Ingestion Pipeline handles incoming email via a fleet of SMTP servers that accept connections from external mail servers (MTA-to-MTA communication). Each incoming email passes through: (1) connection-level checks (SPF, DKIM, DMARC authentication), (2) spam/phishing classification (ML model scoring), (3) virus scanning of attachments, (4) content processing (thread detection, label auto-classification). Clean emails are written to the user's mailbox in the Storage Layer.

The Storage Layer is built on a distributed blob store (Google uses Bigtable + Colossus internally; an open-source equivalent would be HBase + HDFS). Each email is stored as a blob with metadata in a structured store. The storage model uses a per-user partition: all emails for a user are stored together, enabling efficient mailbox operations (list, search, delete). Attachments over 1MB are stored separately in a blob store with content-addressed keys for deduplication — if 100 employees receive the same 10MB PDF, only one copy is stored.

The Delivery Layer handles outbound email via SMTP and real-time inbox push. When a user sends an email, the Outbound SMTP Service resolves the recipient's MX record via DNS, establishes a TLS connection, and delivers the email. For intra-system delivery (sender and recipient are both Gmail users), the email bypasses SMTP entirely and is written directly to the recipient's mailbox. Real-time inbox updates are pushed to connected clients via a persistent connection (Server-Sent Events or WebSocket).

Core Components

Spam Filter

The spam filter is a multi-stage ML pipeline. Stage 1 (connection-level): IP reputation scoring using a trained model on sender IP history, checking against real-time blacklists (RBLs), and verifying SPF/DKIM/DMARC alignment. Stage 2 (content-level): a deep learning text classifier (transformer-based) trained on billions of labeled emails, scoring the probability of spam, phishing, malware, and promotional content. Stage 3 (user-level): personalized models that learn from individual user actions (marking as spam, moving to inbox). The system targets <0.1% false positive rate (legitimate email classified as spam) while blocking >99.9% of spam.

Search Index

Email search is powered by a distributed inverted index built on a system similar to Google's Percolator. Every email is tokenized, stemmed, and indexed by: sender, recipient, subject words, body text, attachment filenames, labels, and date ranges. The index is partitioned per-user (not globally) to enable account-scoped search. Each user's index shard is stored alongside their mailbox data. For users with millions of emails, the index uses a tiered structure: recent emails (last 30 days) in a hot in-memory index, older emails in an on-disk compressed index. Query execution uses BM25 ranking with recency and engagement (starred, replied) boosts.

Conversation Threading

Threading groups related emails into conversations. The algorithm uses the In-Reply-To and References email headers (RFC 2822) as the primary signal. When headers are missing or broken (common with mailing lists), a secondary heuristic matches on subject line (stripped of Re:/Fwd: prefixes) combined with participant overlap. Each conversation is assigned a thread_id; new emails matching an existing thread are appended. The thread view shows emails in chronological order with quoted text collapsed. Thread metadata (last_message_date, participant_list, snippet) is denormalized into the mailbox index for efficient inbox rendering.

Database Design

The mailbox store uses a wide-column database (Bigtable/HBase) with row key {user_id}#{email_id}. Columns include: subject, from, to, cc, bcc, date, thread_id, labels (set), snippet (first 200 chars), body_ref (pointer to blob store), attachment_refs (list of blob keys), is_read, is_starred, spam_score. The row key design enables efficient scans: listing all emails for a user sorted by date is a simple range scan. Column families separate frequently-read metadata (subject, from, date) from rarely-read body content.

Attachment blobs are stored in a content-addressed blob store (SHA-256 hash as the key). A reference counting system tracks how many emails reference each blob; when the count reaches zero (all referencing emails deleted and trash emptied), the blob is garbage collected. Email search indexes use a custom inverted index format stored per-user, with the index itself residing in the same distributed storage system as the mailbox data.

API Design

POST /api/v1/messages/send — Send email: {to, cc?, bcc?, subject, body_html, body_text, attachment_ids?}; returns message_id and thread_id
GET /api/v1/messages?label={label}&page_token={token}&max_results=50 — List messages in a label (inbox, sent, custom) with cursor pagination
GET /api/v1/messages/search?q={query}&max_results=20 — Full-text search: supports operators like from:, to:, subject:, has:attachment, before:, after:
PATCH /api/v1/messages/{message_id} — Modify message metadata: {add_labels?, remove_labels?, is_read?, is_starred?}

Scaling & Bottlenecks

The spam filter is the highest-throughput component, processing every inbound email in real time. Running a transformer model on 3.47M emails/sec requires a massive GPU inference fleet. Gmail uses a distilled model for initial scoring (runs on CPU, handles 90% of decisions) and a full model only for borderline cases (spam score between 0.3 and 0.7). This two-tier approach reduces GPU costs by 80%. Model updates are deployed continuously using a canary rollout — new models are tested on 1% of traffic for 24 hours before full deployment.

Search index maintenance at this scale is challenging. Every email write triggers an index update; 3.47M index writes/sec requires a high-throughput indexing pipeline. The solution is batched asynchronous indexing: emails are written to the mailbox immediately, and index updates are batched and applied every 5 seconds per user shard. This means a newly received email may not appear in search results for up to 5 seconds — an acceptable trade-off. Index compaction (merging small segments into larger ones) runs as a background process during off-peak hours.

Key Trade-offs

Per-user index partitioning over global index: Per-user indexes eliminate the need for access control in search queries and enable per-account storage quotas, but make cross-account search (admin/compliance tools) more complex
Content-addressed attachment storage: Deduplication saves ~30% storage for enterprise accounts where the same files are emailed to many recipients, but requires reference counting and garbage collection, adding operational complexity
Asynchronous search indexing over synchronous: Batching index updates reduces write amplification and I/O, but introduces a 5-second window where new emails are not searchable
Multi-tier spam filtering (CPU + GPU): Running the full ML model only on borderline cases reduces compute costs dramatically, but introduces a small risk of misclassification for emails that the lightweight model confidently miscategorizes