System Design: Video Transcoding Pipeline

Requirements

Functional Requirements:

Accept video uploads in any common format (MP4, MOV, AVI, MKV, WebM)
Transcode each video into a configurable bitrate ladder (e.g., 7 resolutions × 3 codecs)
Generate thumbnails at configurable intervals
Extract and index audio tracks, subtitles, and metadata
Support priority queues (paid users processed before free tier)
Provide webhook/polling API for job status and progress

Non-Functional Requirements:

Process 1 million videos per day with P99 completion time under 10 minutes for a 10-minute video
At-least-once processing guarantee — no video is silently dropped
Horizontal scalability: handle 10x traffic spikes during events
Fault tolerant: worker crashes do not lose partially processed work
Cost efficient: use spot/preemptible instances for 70%+ of compute

Scale Estimation

1 million videos/day = ~12 videos/sec ingested. Average video duration 5 minutes at 1080p source. Each video produces ~20 output renditions (7 resolutions × 3 codecs). Encoding a 5-minute video to one rendition takes ~2 minutes on a modern CPU (x264 medium preset) or ~12 seconds on GPU (NVENC). Total CPU-hours/day: 1M videos × 20 renditions × 2 min = 667K CPU-hours. With GPU: 1M × 20 × 0.2 min = 67K GPU-hours. Storage: 1M × 500MB average output = 500TB/day of encoded output. Intermediate scratch space: 2PB (assuming 2x working set for segment-parallel encoding).

High-Level Architecture

The pipeline is organized as a DAG (Directed Acyclic Graph) of tasks orchestrated by a workflow engine. When a video is uploaded, the Upload Service writes the raw file to S3 and publishes a job to the Job Queue (SQS or Kafka). The Orchestrator (Apache Temporal or a custom state machine) decomposes the job into a DAG of tasks: (1) Probe — extract codec info, resolution, duration using ffprobe; (2) Split — segment the video into 4-second chunks at GOP boundaries; (3) Encode — transcode each segment for each rendition in parallel; (4) Merge — concatenate encoded segments into final output; (5) Package — generate DASH/HLS manifests; (6) Postprocess — thumbnail extraction, audio normalization, metadata tagging.

Each task is dispatched to a Worker Pool via a task queue (one queue per task type). Workers pull tasks, execute FFmpeg commands, write outputs to S3, and report completion to the Orchestrator. If a worker crashes (spot instance preemption), the Orchestrator detects the missing heartbeat and re-queues the task. The Orchestrator maintains the DAG state in a durable store (PostgreSQL or DynamoDB), ensuring exactly-once progression through the DAG even across failures.

A Priority Router sits between the Job Queue and the Orchestrator, assigning priority levels (P0 for live/premium, P1 for standard paid, P2 for free tier). Higher-priority jobs preempt lower-priority ones by draining workers — lower-priority tasks are paused and requeued.

Core Components

DAG Orchestrator

The Orchestrator (built on Apache Temporal or AWS Step Functions) models each transcode job as a DAG of ~50-200 tasks (depending on video length and rendition count). Each node in the DAG represents an atomic task (encode segment N for rendition R). Dependencies are modeled as edges: Merge depends on all Encode tasks for that rendition completing; Package depends on all Merge tasks. The Orchestrator uses a work-stealing scheduler — idle workers pull the next available task from a priority queue, maximizing utilization. DAG state is checkpointed to PostgreSQL every 5 seconds; on Orchestrator crash, a standby takes over from the last checkpoint.

FFmpeg Worker Fleet

Workers are stateless containers (ECS or Kubernetes pods) running FFmpeg. Each worker pulls a task descriptor from SQS, downloads the input segment from S3, runs the FFmpeg command, uploads the output to S3, and ACKs the task. Workers are a mix of CPU instances (c6i.8xlarge for x264/x265) and GPU instances (g5.xlarge with NVENC for real-time encoding). Spot instances constitute 80% of the fleet; a Spot Interruption Handler gracefully saves in-progress work to S3 and requeues the task within 2 minutes of the interruption notice.

Quality Analysis Service

After encoding, a Quality Analysis Service computes VMAF scores for each rendition by comparing against the source. If a rendition's VMAF score falls below a threshold (e.g., 80 for 720p), the rendition is re-encoded with a higher bitrate. This feedback loop ensures consistent perceptual quality across all content types. The service runs on GPU instances using Netflix's VMAF tool and processes results asynchronously — quality verification does not block the critical path of making content available.

Database Design

Job metadata is stored in PostgreSQL: Jobs table (job_id, video_id, priority, status, created_at, completed_at, error_message), Tasks table (task_id, job_id, task_type, status, worker_id, started_at, completed_at, input_s3_path, output_s3_path, retry_count). A partial index on status='PENDING' enables efficient polling by workers. The Orchestrator uses advisory locks in PostgreSQL to prevent double-scheduling of tasks.

Rendition metadata (codec, resolution, bitrate, VMAF score, file size, S3 path) is stored in a separate Renditions table linked to video_id. This table is queried by the video serving layer to construct DASH/HLS manifests. A Redis cache stores active job status for the progress API (job_id → {progress_pct, current_task, eta_seconds}), updated by workers on each task completion.

API Design

POST /api/v1/transcode — Submit a transcode job; body contains video_id, s3_input_path, output_config (resolutions, codecs, bitrate_mode); returns job_id
GET /api/v1/transcode/{job_id}/status — Poll job status; returns progress percentage, current task, ETA, and any errors
POST /api/v1/transcode/{job_id}/cancel — Cancel a running job; in-progress tasks are aborted, completed segments are cleaned up
GET /api/v1/transcode/{job_id}/outputs — List all completed renditions with S3 URLs, codec info, VMAF scores

Scaling & Bottlenecks

The encoding step dominates pipeline latency and cost (90%+ of compute). Segment-parallel encoding — splitting the video into chunks and encoding each independently — converts a serial 10-minute job into a parallel job that completes in seconds with enough workers. The key challenge is ensuring consistent quality across segment boundaries: each segment must start with a keyframe, and the rate controller must be initialized with parameters from the previous segment's final state (lookahead buffer sharing via a metadata sidecar).

Spot instance management is critical for cost efficiency. The fleet uses a diversified instance strategy across 10+ instance types and 6+ AZs to minimize simultaneous interruptions. A capacity planner monitors spot pricing and availability, shifting workload to the cheapest available instance types. During spot shortages, critical jobs (P0) fall back to on-demand instances while P2 jobs are delayed. Auto-scaling targets 80% worker utilization — below 80%, scale in; above 90%, scale out.

Key Trade-offs

Segment-parallel vs whole-file encoding: Parallel encoding reduces latency from minutes to seconds but introduces segment boundary artifacts and slightly worse compression efficiency (2-3% bitrate overhead) — mitigated by overlapping segments with trim
GPU (NVENC) vs CPU (x264) encoding: GPU is 10-15x faster but produces 15-20% larger files at equivalent quality — use GPU for live/real-time, CPU for archival/VOD where quality per bit matters more
Spot instances vs on-demand: 70% cost savings with spot, but interruptions require robust checkpointing and requeue logic — the engineering complexity is justified at scale
VMAF quality gate vs skip quality check: VMAF analysis adds 30 seconds per rendition but catches encoding failures and ensures consistent quality — essential for premium content, optional for user-generated content