System Design: Stories Feature (Instagram-style)

Requirements

Functional Requirements:

Users can upload photos and short videos (up to 15 seconds per segment) as Stories
Stories expire automatically after 24 hours
Users can view stories from followed accounts in a horizontal tray UI
Story creators can see a list of viewers with timestamps
Users can reply to stories via direct message
Support for interactive stickers (polls, questions, quizzes, sliders)

Non-Functional Requirements:

500M DAU posting or viewing stories; 1 billion stories created daily
Story playback must start within 500ms of tap (pre-fetching required)
99.95% availability; stories must not disappear before the 24-hour window
Viewer list must be eventually consistent within 5 seconds

Scale Estimation

500M DAU each viewing an average of 30 stories yields 15 billion story views per day — roughly 173K reads/sec. On the write side, 1 billion stories/day equals approximately 11,600 uploads/sec. Average story media size is 2MB compressed, resulting in 2PB of new storage daily. However, since stories expire in 24 hours, the steady-state storage is capped at roughly 2PB at any given moment (plus a buffer for Stories Highlights). Viewer list writes are enormous: with an average of 150 views per story and 1B stories, that is 150 billion viewer entries per day — roughly 1.7M writes/sec to the viewer tracking system.

High-Level Architecture

The Stories architecture is split into three subsystems: ingestion, delivery, and lifecycle management. The ingestion path handles media upload: the mobile client uploads to a Media Upload Service via chunked transfer to the nearest edge PoP. The Upload Service streams bytes to S3-compatible object storage and emits a Kafka event to trigger the Media Processing Pipeline, which transcodes video to multiple bitrates (480p, 720p, 1080p), generates preview thumbnails, and runs content safety classification. Once processing completes, the Story Metadata Service writes a record to a Cassandra cluster partitioned by user_id with a TTL of 86400 seconds (24 hours), leveraging Cassandra's native TTL for automatic expiry.

The delivery path is optimized for the tray experience. When a user opens the app, a Story Tray Service fetches the list of followed accounts that have active stories. This is powered by a Redis bitmap where each user_id has a bit indicating whether they have an active story. The client then pre-fetches the first frame of each visible story in the tray. When a user taps on a story, subsequent segments are fetched from CDN using pre-signed URLs. A separate Viewer Tracking Service records each view event into a Kafka topic, which is consumed by workers that append to a ScyllaDB table keyed by story_id.

Core Components

Story Lifecycle Manager

The lifecycle manager is responsible for creation, expiry, and archival of stories. On creation, it writes metadata (story_id, user_id, media_urls, created_at, sticker_data) to Cassandra with a 24-hour TTL. A separate background job runs every minute to check for stories approaching expiry and sends a Kafka event to trigger cleanup — CDN cache invalidation, media deletion from origin storage, and viewer list compaction. For Stories Highlights (user-pinned stories that persist indefinitely), the TTL is removed and data is migrated to a long-term Cassandra table without TTL.

Story Tray Aggregation Service

The tray must show which followed accounts have active stories, ordered by recency and engagement affinity. A Redis sorted set per user_id stores followed accounts with active stories, scored by the latest story's timestamp. When a user posts a new story, a fan-out worker updates the sorted sets of all their followers. For accounts with millions of followers, a lazy evaluation approach is used: the tray service queries the active-story bitmap at read time and filters against the user's follow list. The tray response includes pre-computed CDN URLs for thumbnail previews.

Interactive Sticker Engine

Polls, questions, and quizzes require a real-time aggregation layer. Each sticker interaction (poll vote, quiz answer) is sent to a Sticker Interaction Service that writes to a Redis hash keyed by sticker_id. The hash fields store aggregated counts (e.g., option_a_count, option_b_count for polls). A WebSocket connection pushes live aggregation updates to the story creator's device. For high-traffic stories (celebrities), a sharded Redis approach with periodic merge ensures no single Redis node becomes a hotspot.

Database Design

Story metadata lives in Cassandra with partition key user_id and clustering key created_at DESC. Columns include story_id (UUID), media_type, media_urls (list of CDN URLs for each resolution), sticker_metadata (JSON), and expiry_at. Cassandra's TTL feature automatically tombstones rows after 24 hours, but compaction must be tuned aggressively (LeveledCompactionStrategy) to handle the high tombstone ratio. Viewer lists are stored in ScyllaDB with partition key story_id and clustering key viewer_user_id — this supports efficient pagination of the viewer list.

Media assets are stored in S3 with a lifecycle policy that deletes objects older than 48 hours (24-hour buffer beyond story expiry for late CDN cache eviction). The CDN (Cloudflare or Fastly) caches story media with a Cache-Control max-age of 86400 and uses a purge API to invalidate on early deletion. Redis bitmaps for the active-story flag use approximately 125MB per 1 billion users — a single Redis node can hold this in memory.

API Design

POST /api/v1/stories — Upload a new story segment; accepts media + sticker config; returns story_id with CDN URLs
GET /api/v1/stories/tray?limit=30 — Fetch the story tray for the authenticated user; returns ordered list of user_ids with active stories and thumbnail URLs
GET /api/v1/stories/{user_id}?cursor={story_id} — Fetch all active story segments for a user with pre-signed media URLs
GET /api/v1/stories/{story_id}/viewers?cursor={viewer_id}&limit=50 — Paginated viewer list for a story

Scaling & Bottlenecks

The viewer tracking system faces the highest write throughput at 1.7M writes/sec. ScyllaDB handles this by sharding story_id across a 50-node cluster using Murmur3 partitioning. Hot stories from celebrities with millions of viewers are further sharded using a composite partition key (story_id, shard_bucket) where shard_bucket = viewer_id % 10, distributing writes across 10 partitions per story. Viewer count is maintained as a separate Redis counter incremented atomically, avoiding a full partition scan for count queries.

Story tray fan-out is the second bottleneck. When a celebrity posts a story, updating 100M followers' tray sorted sets is infeasible synchronously. Instead, celebrity stories use a pull-based model: the tray service checks a global "active celebrity stories" cache (refreshed every 30 seconds) and merges with the user's precomputed tray at read time. This hybrid push/pull model keeps fan-out bounded at O(followers) for regular users and O(1) for celebrities.

Key Trade-offs

Cassandra TTL vs. explicit deletion jobs: TTL provides automatic cleanup but generates heavy tombstone load requiring careful compaction tuning — the simplicity outweighs the operational cost
Push tray updates vs. pull on read: Push gives instant tray freshness for regular users but is prohibitive for celebrities — the hybrid model adds complexity but bounds write amplification
Pre-fetching story media vs. on-demand: Fetching the first frame of all visible tray stories on app open uses 5-10MB of bandwidth but makes playback feel instant — a worthwhile trade-off for user experience
ScyllaDB over Cassandra for viewer lists: ScyllaDB's shard-per-core architecture handles the extreme write throughput of viewer tracking more efficiently, at the cost of a smaller ecosystem