System Design: Video Thumbnail Generation Service

Requirements

Functional Requirements:

Extract candidate thumbnail frames from uploaded videos at configurable intervals
AI-powered smart thumbnail selection: choose the most visually appealing and representative frame
Generate thumbnails in multiple resolutions (120×90 to 1280×720) and formats (JPEG, WebP, AVIF)
Create scrub preview sprite sheets for seek-bar hover previews
Support custom thumbnail uploads by creators
A/B test multiple thumbnails per video to optimize click-through rate

Non-Functional Requirements:

Process 500K new videos per day, generating thumbnails within 60 seconds of upload
Serve 50 billion thumbnail requests per day with p99 latency under 50ms
Storage efficient: average 500KB per video across all thumbnail variants
99.99% thumbnail availability — missing thumbnails directly impact browse experience
Support content-aware cropping for different aspect ratios (16:9, 4:3, 1:1, 9:16)

Scale Estimation

500K videos/day, each producing 10 candidate frames × 6 resolutions × 3 formats = 180 thumbnail images per video. Total: 90M thumbnails generated/day. At average 15KB per thumbnail, that is 1.35TB/day of thumbnail storage. After smart selection reduces to the best 3 candidates × 6 resolutions × 3 formats = 54 images stored per video = 27M stored images/day. Sprite sheets: 500K × 200KB average = 100GB/day. Serving: 50B requests/day = 580K requests/sec at average 20KB response = 93 Gbps sustained CDN egress for thumbnails alone.

High-Level Architecture

The thumbnail service consists of three subsystems: Generation, Selection, and Serving. The Generation subsystem consumes video-uploaded events from Kafka. A Frame Extractor worker pool runs FFmpeg to extract frames at uniform intervals (every 5 seconds for short videos, every 30 seconds for long videos) plus scene-change detection frames (using FFmpeg's select=gt(scene,0.4) filter). Extracted frames are stored temporarily in S3 as raw PNGs.

The Selection subsystem runs an ML pipeline on the candidate frames. A quality scorer (CNN-based) rates each frame on sharpness, lighting, composition, and face presence. A representativeness scorer computes how well the frame captures the video's overall content (using CLIP similarity between frame embedding and video-level embedding). The top 3 frames are selected. Each selected frame is then processed by the Image Processing Pipeline: resize to 6 standard resolutions, encode in JPEG/WebP/AVIF, apply content-aware cropping for non-16:9 aspect ratios, and write to the permanent thumbnail S3 bucket.

The Serving subsystem uses a multi-tier CDN with aggressive caching. Thumbnail URLs are content-addressed: thumbs/{video_id}/{hash}.{format}. Since thumbnails are immutable (a new thumbnail gets a new hash), CDN cache-control is set to max-age=31536000, immutable. A Thumbnail Router API accepts requests with parameters (video_id, width, format) and resolves to the appropriate CDN URL via a Redis-backed metadata lookup. For A/B testing, the router returns different thumbnail variants based on a hash of the viewer's user_id.

Core Components

Frame Extraction Engine

The Frame Extractor runs as a horizontally scaled Kubernetes deployment. Each pod pulls a job from an SQS queue, downloads the source video from S3 (or reads directly from the transcoding pipeline's intermediate storage), and runs FFmpeg with two extraction strategies in parallel. The uniform extraction runs ffmpeg -i input.mp4 -vf fps=1/5 -q:v 2 frame_%04d.jpg to get one frame every 5 seconds. The scene-change extraction runs ffmpeg -i input.mp4 -vf "select=gt(scene,0.4),showinfo" -vsync vfn to capture transition points. Duplicate and near-duplicate frames are removed using perceptual hashing (pHash with Hamming distance < 5). The result is 10-30 unique candidate frames per video._

Smart Thumbnail Selector

The selection model is a multi-task CNN (EfficientNet-B4 backbone) trained on historical click-through data. The model takes a 224×224 center crop of each candidate frame and predicts: aesthetic_score (trained on AVA dataset), click_probability (trained on impression-click pairs from the platform's A/B test logs), and text_legibility_score (predicting whether overlaid title text will be readable). The composite selection score is a weighted sum: 0.5 × click_probability + 0.3 × aesthetic_score + 0.2 × text_legibility_score. The model runs on GPU inference servers (NVIDIA T4) with batch inference — 100 frames processed in ~200ms per batch.

Sprite Sheet Generator

Scrub preview sprite sheets are contact sheets of miniature thumbnails displayed when the user hovers over the seek bar. The generator creates a grid of 160×90 thumbnails, one per second of video (or one per 5 seconds for long videos). For a 10-minute video at 1 frame/sec, the sprite sheet is a 10×60 grid = 600 thumbnails at 160×90 each, assembled into a single JPEG at ~200KB. The sprite sheet is accompanied by a WebVTT file mapping timestamp ranges to x,y coordinates in the sprite image. The player uses CSS background-position to display the appropriate thumbnail on hover.

Database Design

Thumbnail metadata is stored in PostgreSQL: Thumbnails table (thumbnail_id, video_id, source_type [auto/custom], frame_timestamp_ms, selection_score, is_active, created_at). Each thumbnail has entries in a ThumbnailVariants table (thumbnail_id, width, height, format, file_size_bytes, s3_path, cdn_url). The active thumbnail for each video is cached in Redis as a hash map: thumb:{video_id} → {format → cdn_url} for O(1) lookups during page rendering.

A/B test configuration is stored in a separate ExperimentsDB (PostgreSQL): Experiments table (experiment_id, video_id, variant_a_thumbnail_id, variant_b_thumbnail_id, traffic_split, start_date, end_date, status). Results (impressions and clicks per variant) are aggregated in ClickHouse from client telemetry events. A nightly batch job evaluates experiment results (chi-squared test for statistical significance) and promotes the winning variant to the active thumbnail.

API Design

GET /api/v1/thumbnails/{video_id}?w=320&h=180&fmt=webp — Fetch the active thumbnail for a video; returns a 302 redirect to the CDN URL
POST /api/v1/thumbnails/{video_id}/custom — Upload a custom thumbnail; body is multipart image; triggers validation and resizing pipeline
GET /api/v1/thumbnails/{video_id}/sprite.jpg — Fetch the scrub preview sprite sheet
POST /api/v1/thumbnails/{video_id}/experiment — Create an A/B test between two thumbnails; body contains variant thumbnail IDs and traffic split

Scaling & Bottlenecks

The primary bottleneck is the frame extraction step, which requires downloading and decoding the source video. For a 10-minute 1080p video, FFmpeg must decode ~18,000 frames even though only 10-30 are extracted. Optimization: use FFmpeg's -ss (seek) flag to jump directly to target timestamps without decoding intermediate frames, reducing extraction time from minutes to seconds. For scene-change detection (which requires decoding all frames), use a lightweight decoder (NVDEC on GPU) or sample at a reduced resolution (360p) for scene analysis.

Serving 50B requests/day at 580K/sec requires aggressive CDN caching. Since thumbnail URLs are content-addressed and immutable, the CDN cache hit rate approaches 99%. The remaining 1% of cache misses (new thumbnails, cold PoPs) must be handled by the origin (S3 behind CloudFront with origin shield). S3 handles origin requests at ~5,800 req/sec (the 1% of 580K), well within S3's capacity. Image format negotiation (serving WebP to Chrome, AVIF to supported browsers, JPEG as fallback) is handled by the CDN edge using Accept header inspection.

Key Trade-offs

AI-selected vs creator-selected thumbnails: AI selection provides a baseline quality for all videos, but creator-selected thumbnails often outperform because creators know their audience — the system defaults to AI but allows creator override
Multiple formats (JPEG/WebP/AVIF) vs JPEG only: WebP is 25-30% smaller than JPEG, AVIF 40-50% smaller, but format negotiation adds CDN complexity — the bandwidth savings at 50B requests/day make multi-format essential
Sprite sheets vs individual frame requests: A single sprite sheet request replaces hundreds of individual thumbnail requests during scrubbing, reducing CDN load by 100x — the trade-off is downloading the entire sprite upfront (~200KB)
Content-addressed URLs vs mutable URLs: Content-addressing enables immutable CDN caching (infinite TTL) but requires updating all references when a thumbnail changes — a redirect layer (video_id → current hash) handles this elegantly