System Design: Amazon S3 (Object Storage)

Requirements

Functional Requirements:

Store and retrieve objects of any size (1 byte to 5 TB) via PUT/GET/DELETE
Organize objects into buckets with unique keys
Support multipart upload for large objects (>100 MB)
Versioning support — maintain multiple versions of the same key
Lifecycle policies — auto-expire or transition objects to cheaper storage tiers
Server-side encryption (SSE) with customer or managed keys

Non-Functional Requirements:

11 nines of durability (99.999999999%)
4 nines of availability (99.99%)
Exabyte-scale storage capacity
Support millions of objects per bucket
Sub-100ms latency for small object GET operations
Throughput: sustain 1 TB/s aggregate writes across the system

Scale Estimation

Amazon S3 stores over 300 trillion objects (as of 2023). Assuming an average object size of 100 KB, that's 30 exabytes. At 11 nines durability with erasure coding (14+4 Reed-Solomon — 14 data shards, 4 parity shards), each object is divided into 14 data chunks, 4 parity chunks computed, and all 18 chunks stored on different storage nodes in different availability zones. Total stored data is 18/14 = 1.29x the raw object size. At 30 exabytes raw, total physical storage is ~39 exabytes. With 12 TB HDDs, that's ~3.25 million drives.

High-Level Architecture

S3's architecture separates metadata management from data storage. The control plane handles bucket operations, IAM policy enforcement, and object metadata (key-to-location mapping). The data plane handles actual byte read/write operations across a fleet of storage nodes. A front-end service (load-balanced fleet of API servers) accepts client requests, validates credentials via STS/IAM, routes PUT operations to a data coordinator, and routes GET operations directly to the appropriate storage nodes via metadata lookup.

Buckets are logical namespaces. Bucket metadata (owner, region, policy, versioning config) is stored in a globally distributed metadata service (similar to DynamoDB, backed by Paxos/Raft consensus). Object metadata (key → list of chunk locations) is stored in a partitioned metadata store. For a PUT operation, the front-end assigns a unique object ID, selects a set of storage nodes (one per AZ, distributed by consistent hashing on object ID), breaks the object into 14 data chunks, computes 4 Reed-Solomon parity chunks, and distributes all 18 chunks to the selected nodes. Metadata is written to the metadata store once all 18 chunks are durably stored.

For GET operations, the front-end looks up the object metadata (chunk locations), sends parallel GET requests to 14 storage nodes (only data chunks needed in the happy path), reassembles the chunks in order, and streams the object to the client. If any storage node is slow (timeout > 50ms), the front-end hedges by also requesting from a parity chunk node and reconstructing the missing data chunk. This hedge strategy keeps tail latency low without sacrificing availability.

Core Components

Erasure Coding (Reed-Solomon)

Reed-Solomon erasure coding provides durability beyond 3-way replication at lower storage overhead. With a (14, 4) RS code, the original data is split into 14 equal shards. 4 parity shards are computed such that any 14 of the 18 total shards are sufficient to reconstruct the original data. Storage overhead is 18/14 ≈ 1.29x (vs. 3x for replication). Durability calculation: with 18 shards across 18 failure domains (different racks/AZs), an object is lost only if 5+ shards fail simultaneously. With an annual drive failure rate of 0.5%, the probability of losing 5 specific drives in a year is astronomically low — achieving 11 nines of durability.

Consistent Hashing Ring

Storage nodes are organized on a consistent hashing ring. Each node is assigned multiple virtual nodes (vnodes) on the ring to ensure even load distribution even when real nodes differ in capacity. A new object's chunk is stored on the node whose vnode immediately follows the chunk's hash on the ring. When a storage node is added or removed, only the objects whose chunks were on that node's vnodes are redistributed, minimizing data movement. Ring state is maintained by a cluster coordinator (ZooKeeper or etcd) and gossiped to all front-end nodes for local routing decisions.

Multipart Upload

For objects >5 GB (or for efficient uploads over slow connections), multipart upload splits the object into parts (5 MB–5 GB each), uploaded independently. The client initiates a multipart upload session (receives an upload_id), uploads parts in parallel (each returning an ETag), and completes the upload with a manifest of (part_number, ETag) pairs. S3 assembles the parts, computes the final erasure-coded chunks, and atomically commits the object metadata. Incomplete multipart uploads are automatically cleaned up after 7 days via a lifecycle rule. Parts are stored in temporary storage nodes until the complete operation is received.

Database Design

Object metadata is stored in a distributed key-value store partitioned by (bucket_name + object_key) hash. Each record stores: object_id (UUID), bucket_name, object_key, version_id, content_type, content_length, etag (MD5 or SHA-256), storage_class, encryption_key_ref, acl, created_at, last_modified, and chunk_locations (array of 18 (node_id, shard_index) pairs). Versioning is implemented by allowing multiple records for the same (bucket, key) with different version_ids; a separate version index maintains the ordered list of versions per key. The metadata store uses Raft for strong consistency — object metadata is never lost even if the metadata leader fails.

API Design

Scaling & Bottlenecks

The metadata store is the primary scaling bottleneck. S3's metadata service must handle millions of PUT/GET metadata operations per second. Key optimizations: (1) partition metadata by consistent hash of (bucket + key) across thousands of metadata nodes; (2) use an LSM-tree-based storage engine (LevelDB/RocksDB) optimized for write-heavy workloads; (3) cache hot metadata (frequently accessed keys) in a distributed DRAM cache with 1-second TTL. Bucket listing (LIST operation) is notoriously slow at scale because it requires a range scan over all keys in a bucket prefix — S3 recommends using random key prefixes to distribute LIST load.

Storage node failures are handled gracefully by the erasure coding properties. When a node goes offline, the cluster coordinator marks it as unavailable and triggers a background reconstruction job: for each object with a shard on the failed node, the cluster fetches 14 surviving shards, reconstructs the missing shard, and writes it to a replacement node. Reconstruction throughput is rate-limited to avoid saturating surviving nodes. With an MTTF of 1 year per drive and MTTR of 6 hours (automatic reconstruction), data durability remains well above 11 nines.

Key Trade-offs

Erasure coding vs. replication: RS codes achieve higher durability at lower storage overhead but require reconstruction CPU on node failures; replication is simpler and faster for small objects
Strong vs. eventual consistency: S3 now offers strong read-after-write consistency for all operations (since 2020), achieved via version fencing in the metadata layer — this adds complexity but eliminates confusing stale-read scenarios for users
Storage tiers (S3 Standard vs. Glacier): Moving objects to Glacier reduces cost 5x but increases retrieval latency from ms to hours; lifecycle policies automate the transition based on access patterns
Multipart threshold: Forcing multipart for large objects (>5 MB) parallelizes uploads and enables resumability but adds state management overhead; the threshold is a balance between simplicity and upload efficiency