System Design: Google Drive

Requirements

Functional Requirements:

Store, organize, and retrieve any file type up to 5 TB
Real-time collaborative editing for Google Docs, Sheets, and Slides
File sharing with granular permissions (viewer, commenter, editor)
Version history with restore capability (30 days for free, unlimited for Workspace)
Cross-device synchronization (web, desktop, mobile)
Full-text search within documents

Non-Functional Requirements:

99.99% availability
Support 3 billion users with 15 GB free storage each
Collaborative editing: <100ms latency for operation propagation
File upload/download throughput: sustain 10 TB/s aggregate
GDPR compliance — user data stored in user's region

Scale Estimation

With 3 billion users at 15 GB average storage = 45 exabytes of raw data. After compression and deduplication (documents deduplicate heavily), physical storage is ~15–20 exabytes. Google Docs has 1 billion daily active users. Collaborative sessions: assume 100 million concurrent editing sessions at peak. Each session generates 10 operations/sec on average: 1 billion operations/sec peak. Operations are small (typically 10–100 bytes for an OT delta). At 100 bytes average, peak operation bandwidth is 100 GB/sec through the collaboration pipeline.

High-Level Architecture

Google Drive combines a file storage service (for binary files: images, PDFs, videos) with a collaboration service (for native Google documents). The file storage service is similar to a simplified S3 — chunked, deduplicated, erasure-coded. The collaboration service handles real-time Operational Transformation (OT) for Google Docs/Sheets/Slides, maintaining document state as a sequence of operations rather than complete file snapshots. A metadata service manages the file tree (folder hierarchy, sharing permissions, ownership). A search service indexes document content for Drive search.

For binary file storage, the architecture mirrors S3: files are chunked (block size: 256 KB to 2 MB), SHA-256 hashed, deduplicated, and stored with Reed-Solomon erasure coding across multiple Colossus storage clusters. File metadata (chunk list, permissions, version history) is stored in Spanner for global consistency with strong read guarantees. File sync uses a change journal: each modification is recorded as a journal entry (file_id, version, change_type, chunk_diff). Clients poll or subscribe (via WebSocket) for their change journal to detect remote modifications.

For collaborative documents, the system uses Operational Transformation (OT). Each editing operation (insert character at position, delete range, apply formatting) is represented as an OT operation. When two users simultaneously edit, the OT algorithm transforms concurrent operations to be commutative — both edits are preserved without conflicts. A central OT server serializes all concurrent operations from all collaborators, assigns a global sequence number, transforms and broadcasts to all participants. The document state at any point is the result of applying all operations in sequence order from the beginning.

Core Components

Operational Transformation Engine

OT maintains a total order of all operations on a document. The OT server assigns sequence numbers. When client A sends op A (based on state S) and client B sends op B (also based on state S), the server serializes them: say A comes first (seq N), then B must be transformed against A before being applied (producing B', which when applied after A produces the same result as if both were applied from state S in any order). The transformation function is defined per operation type. Google's Wave OT protocol and Jupiter algorithm are foundational references. State is checkpointed periodically (every 1,000 ops) to avoid replaying the full operation history on reconnect.

Tiered Storage

Drive uses a three-tier storage architecture: hot tier (recently accessed files, stored on SSDs in Colossus) for low-latency access; warm tier (files accessed in the last 90 days, on HDDs) for cost-efficient bulk storage; cold tier (Nearline/Coldline storage, tape-backed) for archived files not accessed in 90+ days. Tier transitions are automatic based on access frequency heuristics. Hot → warm transition triggers when a file has not been accessed in 30 days. Warm → cold transition at 90 days. Cold file retrieval takes 2–10 seconds (head seek on tape), which is acceptable for infrequently accessed files.

Sharing & Permission Service

Permissions are stored per (file_id, entity_id) where entity is a user, group, or domain. Permission levels: Owner > Editor > Commenter > Viewer. Inheritance: files in a shared folder inherit folder permissions. Sharing a deeply nested file requires checking the entire folder ancestry for inherited permissions — a potentially expensive operation cached in a permissions cache (Redis, TTL 30 seconds, invalidated on permission change). Public links generate a short, unguessable token that maps to a (file_id, permission_level) record. Drive API enforces permissions on every file operation through a permission middleware that checks both explicit and inherited permissions.

Database Design

File metadata is stored in Spanner: (file_id, owner_id, parent_folder_id, name, mime_type, size, created_at, modified_at, trashed, version, storage_class, chunk_manifest_ref). The chunk manifest (list of chunk hashes) is stored separately in Bigtable (row key: file_id + version, value: serialized chunk list) to avoid bloating Spanner rows. Sharing records are in Spanner: (file_id, principal_id, principal_type, role, inherited_from_folder_id). OT operation log is in Bigtable (row key: doc_id, column qualifier: sequence_number, value: serialized OT op). Periodic snapshots (column family: "snapshots", qualifier: sequence_number) checkpoint document state.

API Design

Scaling & Bottlenecks

The OT server is the single point of serialization for collaborative documents. A single OT server can handle ~50,000 operations/sec. For documents with many simultaneous collaborators (e.g., 500 people editing a shared Sheets), operations pile up. Google shards the OT server by document_id: each document is assigned to one OT server replica group. Within a replica group (typically 3 nodes, Raft-elected leader), the leader serializes operations. Leader re-election on failure completes in 2–5 seconds (Raft default). During leader election, new operations are queued client-side and replayed on reconnect.

Storage efficiency depends critically on deduplication. Google documents are stored as operation logs, not file blobs — a 1 MB Google Doc is stored as thousands of OT operations totaling perhaps 100 KB of operation data, 100x more efficient than the equivalent .docx file. Binary files (images, videos) use content-addressed block storage with global dedup. Cross-user dedup (same file uploaded by different users) requires careful privacy handling: block hashes can reveal that two users have the same content; Google's implementation uses convergent encryption (encrypt block with a key derived from block content hash + user key) to enable dedup while maintaining privacy.

Key Trade-offs

OT vs. CRDT for collaboration: OT requires a central server for operation serialization (simpler consistency, harder to scale); CRDTs are peer-to-peer (no central server needed) but are complex to implement and can produce unexpected merge results for rich document formats
Tiered storage vs. uniform SSD: All-SSD storage provides uniform low latency but is 5–10x more expensive; tiered storage matches cost to access frequency at the price of variable retrieval latency
Chunk size vs. dedup efficiency: Smaller chunks (64 KB) increase dedup ratio but require more metadata (more chunk hash lookups); larger chunks (4 MB) reduce metadata overhead at the cost of less precise dedup
Strong consistency vs. availability for permissions: Spanner's strongly consistent permissions prevent permission-check inconsistencies but limit write throughput for permission updates; the 30-second Redis cache introduces a brief window where revoked permissions may still be honored