SYSTEM_DESIGN

System Design: Snapchat

System design of Snapchat focusing on ephemeral media delivery, Snap Map, Stories, and the unique challenge of message delivery guarantees for disappearing content.

16 min readUpdated Jan 15, 2025
system-designsnapchatephemeral-contentmedia-delivery

Requirements

Functional Requirements:

  • Users send ephemeral photo/video Snaps that disappear after viewing
  • Stories: 24-hour ephemeral posts visible to friends or public
  • Real-time chat with ephemeral messages
  • Snap Map: real-time location sharing with friends
  • Discover: curated content from publishers
  • Lenses/Filters: AR overlays applied in real-time on device

Non-Functional Requirements:

  • 400M DAU; 4B+ Snaps created daily
  • Snap delivery must be reliable (message delivery guarantee despite ephemeral content)
  • Media load latency under 500ms for direct Snaps
  • 99.9% uptime; location data freshness within 30 seconds for Snap Map

Scale Estimation

400M DAU × 10 Snaps sent/day = 4B Snaps/day = 46,296 Snaps/sec. Average Snap size: photo 200KB, video 2MB. Assuming 70% photos, 30% videos: 0.7 × 200KB + 0.3 × 2MB = 140KB + 600KB = 740KB average → 46,296 × 740KB = ~32GB/sec of incoming media bandwidth. Daily storage: 4B × 740KB = 2.96PB/day (before deletion). Snap Map: 400M users sending location every 30 seconds when active → up to 13M location updates/sec at peak.

High-Level Architecture

Snapchat's architecture is built around the concept of deferred consumption. When a Snap is sent, the media is uploaded to a CDN-backed object store (AWS S3) and a delivery record is written to a Snap delivery service. The recipient's app does NOT download the Snap until they open it — instead, it receives a push notification with a pre-signed S3 URL. Upon opening, the app fetches the media directly from S3/CDN, displays it for the configured duration (1-10 seconds), then deletes the local copy. The server marks the Snap as viewed and schedules S3 deletion (with a short delay for screenshot detection).

The backend runs on a hybrid cloud architecture (GCP + AWS) with Kubernetes for orchestration. The core services: Snap Delivery Service (tracks sent/pending/viewed state in Cassandra), Media Service (manages S3 uploads and pre-signed URL generation), Chat Service (WebSocket-based real-time messaging), Stories Service (stores 24-hour sliding window of story posts), and Snap Map Service (geospatial location aggregation and querying). All services communicate via gRPC internally.

Core Components

Snap Delivery Service

The Snap Delivery Service maintains the state machine for each Snap: UPLOADED → DELIVERED → OPENED → DELETED. State is stored in Cassandra with composite key (sender_id, recipient_id, snap_id). Delivery is push-based: when a Snap is uploaded, the service calls Apple APNs or Google FCM to send a silent push notification to the recipient's device. The notification payload contains the snap_id and a pre-signed CDN URL valid for 24 hours. If delivery fails (device offline), Cassandra retains the pending record; the app fetches pending Snaps on next launch.

Stories Service

Stories are aggregated per-user sliding windows of media posts. Each Story post (snap_id, creator_id, media_url, expires_at) is stored in Cassandra with a TTL of 24 hours — Cassandra's native TTL handles automatic deletion. Viewer lists (who has seen which Story) are stored in a separate Cassandra table. Stories from friends are pre-fetched when the app opens and cached locally. The Discover feed (publisher Stories) is served from a separate CDN-cached editorial service updated every few hours.

Snap Map Service

Snap Map is a geospatial feature where users share their approximate location. Location updates from active users are published to Kafka. A stream processing service (Flink) aggregates updates and stores the latest location per user in Redis Geo (a sorted set with geospatial indexing). The map client queries GEORADIUS user_locations {lat} {lon} 50km COUNT 1000 to find friends near a center point. A heatmap layer uses a separate spatial aggregation pipeline (H3 hexagonal grid, Uber's library) that counts users per hex cell and stores in Redis.

Database Design

Snap delivery state uses Cassandra with schema: snaps table (snap_id UUID, sender_id, recipient_id, media_url, snap_type, duration, created_at, status, expires_at). Partition key is recipient_id for efficient 'pending snaps for user X' queries. A TTL of 30 days is set on all rows; ephemeral deletion happens earlier (on open). Stories use a Cassandra table with a 24-hour TTL. User relationships (friends) are stored in a separate service backed by DynamoDB.

For chat messages, a Cassandra table stores messages with composite key (conversation_id, message_id). Ephemeral chat messages use a 24-hour TTL by default. Media objects live on S3 with lifecycle policies: Snap media is deleted after 31 days (legal retention floor); Story media is deleted after 48 hours (24h TTL + 24h grace for late viewers). Metadata about deletions is logged to S3 audit logs for compliance.

API Design

  • POST /snapchat/bitmoji/v1/snap/send — Upload Snap; multipart with media and recipient list; returns snap_id and media upload URL
  • GET /snapchat/bitmoji/v1/snap/pending — Fetch list of pending Snap snap_ids and pre-signed CDN URLs for the authenticated user
  • POST /snapchat/bitmoji/v1/snap/{snap_id}/open — Mark a Snap as opened; triggers server-side deletion scheduling
  • POST /snapchat/bitmoji/v1/map/location — Update user's location; body contains lat, lon, accuracy, and sharing_mode

Scaling & Bottlenecks

The media upload bandwidth (32GB/sec) is the primary infrastructure cost driver. Snapchat uses client-side media compression (HEIC for photos, H.265 for video) to reduce bandwidth, and direct client-to-S3 upload with pre-signed URLs to bypass application servers. Multipart upload is used for videos > 5MB. The CDN (CloudFront) caches Story media aggressively (Stories have predictable read patterns) but individual Snaps have low cache hit rates since they're single-recipient.

Snap Map at 13M location updates/sec is handled by a Kafka-based ingestion pipeline with Flink stream processors maintaining the Redis Geo store. Location precision is intentionally degraded (snapped to ~100m grid) to reduce update frequency and protect privacy. Rate limiting: location updates are throttled to one per 30 seconds per user. The geospatial query for friend locations uses Redis Geo with a 50km radius cap and count limit of 1,000 to bound query latency.

Key Trade-offs

  • Client-side deletion vs server-side: True ephemerality requires both server deletion (automated by Cassandra TTL) and client enforcement; screenshots are detectable but not preventable — this is a fundamental limitation
  • Pre-signed URLs for media delivery: Bypasses application servers for media downloads (better scalability) but requires careful URL expiry management to prevent unauthorized access after Snap expiry
  • Cassandra TTL for ephemeral data: Native TTL-based deletion is elegant and scalable but has ~±1 minute precision; a separate cleanup job handles stragglers
  • Degraded location precision on Snap Map: Rounding to 100m grid reduces storage precision and update frequency, improving privacy and reducing infrastructure load — an intentional product-infrastructure co-design decision

GO DEEPER

Master this topic in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.