System Design: Online Learning Platform (Coursera-scale)

Requirements

Functional Requirements:

Users can browse, enroll in, and complete courses with video lessons, quizzes, and assignments
Instructors can upload course content including video lectures, PDFs, and interactive exercises
Platform supports graded assessments with automated and peer-reviewed submissions
Learners receive verifiable certificates upon course completion
Platform provides discussion forums, Q&A threads, and peer interaction per course
Progress is tracked granularly: lesson completion, quiz scores, assignment grades

Non-Functional Requirements:

Support 50 million registered users with 5 million daily active learners
Video streaming must handle 500k concurrent streams with <2s start latency
99.9% availability; course content must be accessible even during partial outages
Video uploads processed and available within 15 minutes of instructor submission
Search across 100k+ courses returns results in under 200ms

Scale Estimation

With 50M users and 5M DAU, assume each active learner watches ~45 minutes of video per day. At 720p (2 Mbps), that's ~10 GB of video data per learner per day, totaling ~50 PB/day across the platform — served entirely from CDN edge nodes. Course catalog stores ~100k courses averaging 30 hours of content each, requiring ~30 PB of raw video storage before transcoding into multiple resolutions. Quiz submissions run at ~2M/day, and certificate generation at ~50k/day.

High-Level Architecture

The platform separates into three primary planes: the content delivery plane, the learning management plane, and the assessment plane. The content delivery plane handles all media — video upload, transcoding, and CDN distribution. Instructors upload raw video to an S3-compatible object store; an async transcoding pipeline (using FFmpeg workers on a job queue) produces HLS streams at 360p, 720p, and 1080p. Transcoded segments are pushed to a CDN (CloudFront or Fastly) with edge PoPs in all major regions. Learners receive adaptive bitrate (ABR) streams via HLS, so quality adjusts dynamically to their connection.

The learning management plane handles enrollment, progress, and the course catalog. A relational database (PostgreSQL) stores course metadata, enrollment records, and structured progress data. A Redis cluster caches hot course metadata and learner session state. The course catalog is indexed in Elasticsearch, enabling full-text search with filters on category, level, language, and instructor rating. An event-driven architecture (Kafka) captures learner activity events — video plays, pauses, quiz starts — which feed both real-time progress updates and offline analytics pipelines.

The assessment plane runs quizzes and assignment grading in isolated environments. Multiple-choice quizzes are evaluated instantly by the API layer. Programming assignments are routed to a sandboxed code execution service (similar to a judge system) running in gVisor containers. Peer-reviewed assignments use a matching algorithm to assign reviewers from the same cohort, with a rubric-enforced grading UI.

Core Components

Video Transcoding Service

When an instructor uploads a raw video, the upload service writes it to S3 and publishes a transcoding job to a Kafka topic. A fleet of transcoding workers (horizontally scaled EC2 spot instances) pick up jobs, run FFmpeg to produce HLS segments at multiple bitrates, and store segments back to S3 under a CDN-mapped prefix. A DynamoDB table tracks job status. On completion, the course service is notified via a callback, marking the lesson as publishable. Transcoding workers auto-scale based on queue depth, handling burst uploads (e.g., when Coursera runs a content partner onboarding sprint).

Progress Tracking Service

Progress events (lesson watched, quiz submitted, assignment graded) are written to Kafka by the frontend via a lightweight event API. A stream processor (Flink or Kafka Streams) aggregates events in real time to compute per-learner completion percentages and trigger milestone notifications (e.g., "50% through the course"). Aggregated progress is persisted to PostgreSQL for authoritative state and to Redis for low-latency reads on the learner dashboard. Certificate generation is triggered automatically when completion reaches 100% and all required assessments pass.

Certificate Service

Certificates are generated as signed PDFs with a unique verification UUID. The service uses a PDF rendering engine (Puppeteer or WeasyPrint) with a templated layout, embedding the learner's name, course title, completion date, and a QR code linking to a public verification endpoint. The verification endpoint is a simple stateless lookup against a PostgreSQL table — no authentication required, allowing employers to verify certificates without an account. Certificates are stored in S3 and linked from the learner's profile.

Database Design

Courses, enrollments, and progress use PostgreSQL with the following core tables: courses (course_id, instructor_id, title, description, status), lessons (lesson_id, course_id, order_index, video_url, duration_seconds), enrollments (enrollment_id, user_id, course_id, enrolled_at, completed_at), progress (progress_id, enrollment_id, lesson_id, watched_seconds, completed, last_watched_at), and quiz_attempts (attempt_id, user_id, quiz_id, score, submitted_at). The progress table is the hottest table — indexed on (enrollment_id, lesson_id) with partial indexes for incomplete lessons. For video metadata and CDN mapping, a DynamoDB table offers sub-millisecond reads at scale without schema constraints.

API Design

GET /courses?query={q}&category={cat}&level={level}&page={n} — searches course catalog via Elasticsearch, returns paginated results with rating and enrollment count
POST /enrollments — body: {course_id}, creates enrollment record, returns enrollment_id; idempotent on duplicate
PUT /progress/{enrollment_id}/lessons/{lesson_id} — body: {watched_seconds, completed}, upserts progress record, triggers milestone check async
POST /assessments/{quiz_id}/submit — body: {answers: [...]}, evaluates MCQ instantly or queues code submission, returns score or job_id
GET /certificates/{certificate_id}/verify — public endpoint, returns certificate metadata and validity status

Scaling & Bottlenecks

Video streaming is the dominant scaling concern. At 500k concurrent streams, even with CDN offload, origin shield and CDN cache warm-up strategies are critical. Popular courses (top 1% account for ~60% of views) should be pre-warmed to edge nodes proactively. For long-tail courses with cold CDN cache, an origin shield layer in each region prevents cache-miss storms from overwhelming the S3 origin. The progress tracking write path is also high-volume — 5M DAU each generating dozens of events means ~10k writes/second peak. Using Kafka as the ingestion buffer with async batch writes to PostgreSQL (via COPY) keeps the DB write load manageable.

The course search Elasticsearch cluster must be sized to handle 5M DAU generating search queries at ~500 QPS peak. Index sharding by category and language allows horizontal scaling without cross-shard fan-out on most queries. Quiz evaluation for programming assignments is the hardest scaling challenge: sandbox containers have high CPU overhead, so a dedicated auto-scaling pool of gVisor sandbox workers with a max timeout (10s per submission) is essential to prevent resource exhaustion.

Key Trade-offs

Async transcoding vs. instant availability: Choosing async transcoding with a 15-minute SLA means instructors can't preview live immediately, but synchronous transcoding would block uploads and create single points of failure in the upload flow.
Relational vs. document store for progress: PostgreSQL provides ACID guarantees for enrollment and certificate data but requires careful sharding at scale; a document store (MongoDB) would simplify the progress schema but complicate transactional certificate issuance.
Peer review consistency: Peer grading introduces subjectivity; enforcing rubrics and requiring 3 independent reviews before finalizing a grade improves consistency at the cost of latency (days instead of seconds for assignment feedback).
CDN cost vs. origin load: Aggressive CDN caching with long TTLs reduces origin load and egress cost but makes instructor content updates slower to propagate; a cache invalidation API on publish events balances this trade-off.