System Design: Blob Storage Service

Requirements

Functional Requirements:

Upload and download binary blobs up to 200 GB
Resumable uploads — resume interrupted uploads from the last checkpoint
Global distribution — serve blobs from edge locations near users
Blob transcoding (image resize, video format conversion) on-demand
Access control: public, private, signed URL with expiry
Versioning and immutable storage (WORM — write once, read many)

Non-Functional Requirements:

Upload throughput: 100 GB/s aggregate globally
Download throughput: 1 TB/s aggregate via CDN
Durability: 99.999999999% (11 nines)
Availability: 99.99%
Global download latency: under 50ms from CDN edge for cached blobs

Scale Estimation

Assume 100 million users uploading an average of 10 MB/day = 1 PB/day new data. Over 1 year: 365 PB. With 3x replication, physical storage is 1,095 PB ≈ 1 exabyte. CDN edge nodes cache the most recently accessed 10% of blobs (36.5 PB total cached across CDN). With 500 CDN PoPs globally, each PoP caches ~73 TB. A typical CDN PoP has 2–5 PB of SSD storage, easily fitting 73 TB. CDN cache hit rate for popular content: 70–90%, dramatically reducing origin server bandwidth.

High-Level Architecture

The system has three tiers: edge (CDN), regional origin clusters, and durable storage backends. The edge tier (CDN) serves read requests for cached blobs and accepts upload requests (which are proxied to the regional origin). Regional origin clusters handle upload ingestion, transcoding jobs, and cache miss reads from durable storage. Durable storage (similar to S3) provides the authoritative, erasure-coded, highly durable backing store. A metadata service tracks blob locations, access control, and transcoding status.

Upload flow: client initiates upload via a resumable upload protocol (Google's TUS protocol or custom). The API server allocates an upload ID and returns upload URLs for each chunk (or a single resumable upload URL). The client uploads chunks in parallel (8 concurrent chunk uploads for a 200 GB file → 25 GB chunks → each chunk uploaded via a dedicated connection). Each chunk is written to a regional staging store (fast SSD, 3 replicas). Once all chunks are received, a completion request triggers assembly: chunks are assembled into the final blob, erasure coded, and written to the durable backend. The metadata record is updated to "available".

Download flow: client requests a blob URL. If the blob is public or the client has a valid signed URL, the request routes to the CDN edge. The CDN checks its cache: on hit, streams the blob directly to the client from edge SSD. On miss, the CDN fetches from the regional origin (which fetches from durable storage if needed), caches the blob at the edge, and streams to the client. Byte-range requests (HTTP Range header) enable partial downloads and video seeking without downloading the full blob.

Core Components

Resumable Upload Protocol

The TUS resumable upload protocol uses HTTP PATCH requests with a Tus-Resumable header and Upload-Offset header. On interruption, the client queries the upload server for the current offset (HEAD request), then resumes sending from that byte position. The server appends incoming bytes to the partial upload in staging storage. Chunk-level checksums (MD5 or SHA-256 per 10 MB chunk) detect corruption during transfer. If a chunk is corrupted, only that chunk is retransmitted. The resumable upload session expires after 24 hours; the metadata service garbage-collects incomplete uploads via a TTL-based cleanup job.

CDN Integration & Cache Invalidation

Blobs are cached at CDN edges using a pull-through model: on first request, the CDN fetches from origin and caches. Cache-Control headers set TTLs (immutable blobs: 1-year TTL; mutable blobs with versions: 1-hour TTL). Cache invalidation for updated blobs uses URL versioning: the blob URL includes a content hash suffix (e.g., /blobs/img.jpg?v=abc123), and a new upload creates a new URL. The old URL remains cacheable (served from CDN without expiry). For emergency invalidation (DMCA, security incident), a CDN purge API sends invalidation requests to all 500 PoPs globally; propagation completes in 30–60 seconds.

Transcoding Pipeline

Media blobs trigger automatic transcoding jobs on upload. An image uploaded as a 20 MB RAW file is automatically converted to JPEG, WebP, and AVIF at multiple resolutions (thumbnail 100px, medium 800px, full 1920px). A video uploaded as 4K H.264 is transcoded to H.265, VP9, and AV1 at 1080p, 720p, and 480p. Transcoding jobs run on a dedicated GPU/CPU cluster (NVENC for hardware-accelerated video encoding). Jobs are queued in Kafka, consumed by transcoding workers, and output stored as separate blobs. The metadata service maps the original blob ID to its transcoded variants.

Database Design

Blob metadata is stored in a distributed key-value store (DynamoDB or Cassandra): (blob_id, owner_id, content_type, size, etag, acl, created_at, storage_locations, transcoding_status, worm_expiry). For large-scale deployments, metadata is sharded by blob_id hash across 1,000+ partitions. Access logs (for audit and analytics) are written to an append-only Kafka topic and archived to S3 in Parquet format, queryable via Athena. Signed URL generation uses HMAC-SHA256 with a server-side secret key: the signed URL contains (blob_id, expiry_timestamp, permissions) encoded in the URL, verified server-side on each request without database lookup.

API Design

Scaling & Bottlenecks

Bandwidth is the primary cost and scaling factor. At 1 TB/s aggregate CDN throughput, with average blob size 1 MB and 1 Gbps per CDN server, 8,000 CDN servers are needed globally (1 TB/s = 8 Tbps; 8 Tbps / 1 Gbps per server = 8,000 servers). CDN providers (Cloudflare, Akamai, Fastly) manage hundreds of thousands of servers globally, making this achievable. Upload bandwidth is cheaper (less frequent, less traffic) but still requires significant capacity. Multi-CDN strategy (using 2–3 CDN providers simultaneously) provides failover and enables cost arbitrage between CDN providers.

Storage hot spots occur for viral blobs (a popular image viewed 10 million times/hour). CDN edge caching handles most of this, but CDN misses (for newly viral content before CDN cache warms up) can overwhelm the origin cluster. A "request coalescing" strategy at the origin (collapsing 10,000 simultaneous CDN miss requests for the same blob into a single backend fetch) prevents the thundering herd problem. This is typically implemented in CDN software (Nginx's proxy_cache_lock or Varnish's grace mode).

Key Trade-offs

CDN cache TTL vs. freshness: Long TTLs maximize CDN cache hit rates but prevent updated blobs from propagating quickly; URL versioning sidesteps this by treating each upload as a new, immutable URL
Chunk size vs. upload efficiency: Smaller chunks (1 MB) enable finer-grained resume but generate more HTTP requests and metadata overhead; larger chunks (100 MB) reduce overhead but require re-uploading more data on interruption
Eager vs. lazy transcoding: Transcoding at upload time ensures all formats are ready immediately but wastes CPU for blobs never accessed in all formats; lazy transcoding (on first request for a specific format) is more efficient but adds latency to the first access
WORM immutability vs. flexibility: Immutable storage prevents accidental deletion and meets compliance requirements (HIPAA, financial records) but prevents storage reclamation until expiry