SYSTEM_DESIGN
System Design: Learning Management System (LMS)
Design a scalable Learning Management System (LMS) like Canvas or Moodle that manages course enrollment, content delivery, grading workflows, and institutional reporting for universities and enterprises. Covers multi-tenancy, gradebook architecture, and SCORM compliance.
Requirements
Functional Requirements:
- Institutions (universities, companies) manage courses, rosters, and curricula within isolated tenant environments
- Instructors create and organize content modules: readings, videos, assignments, discussions, and quizzes
- Grading workflows: assignment submission, rubric-based grading, grade override, and gradebook export
- SCORM and xAPI compliance for importing third-party eLearning content packages
- Calendar integration, assignment due-date reminders, and announcement broadcasting
- Reporting dashboards for administrators: course completion rates, at-risk student identification, time-on-task
Non-Functional Requirements:
- Multi-tenant architecture with data isolation between institutions
- Support 10,000 institutions, each with up to 100,000 users
- Gradebook queries (fetch all grades for a course of 500 students) must return in under 500ms
- SCORM content packages up to 2 GB must be processable
- FERPA/GDPR compliance: audit logging of all data access to student records
Scale Estimation
With 10,000 institutions and an average of 10,000 active users each, the platform serves 100M registered users. At any given time, 5% are active (5M concurrent), each generating 1-2 API requests per minute. That's ~100k requests/second. The gradebook is the most query-intensive component: a professor loading grades for a 500-student course with 20 assignments triggers a 10,000-cell read. With 10k such queries/hour at peak, gradebook read throughput is ~28M grade cells/hour. Content storage: average 500 MB of course content per course, 1M active courses = 500 TB of content to store and serve.
High-Level Architecture
The LMS uses a multi-tenant architecture with tenant-per-schema isolation in PostgreSQL. Each institution gets its own PostgreSQL schema (namespace), providing logical isolation without the overhead of per-tenant database clusters. A schema-routing middleware layer maps incoming requests (authenticated with JWT containing tenant_id) to the correct schema. This approach scales to 10k tenants on a shared PostgreSQL cluster with connection pooling (PgBouncer) and allows tenant-specific schema migrations without affecting other tenants.
Content delivery is handled separately from the transactional LMS database. Course files, videos, and SCORM packages are stored in S3, organized by tenant prefix. A content delivery service generates pre-signed S3 URLs (valid for 1 hour) for authorized requests, avoiding the need to proxy large files through the application tier. SCORM packages are unpacked server-side into a virtual filesystem structure in S3 and served via a SCORM runtime JavaScript SDK in the browser, which communicates learner state back to an xAPI endpoint (Learning Record Store).
The reporting and analytics plane is completely separated from the operational LMS. An ETL pipeline (Airflow + dbt) copies transactional data to a Redshift data warehouse nightly. Administrators query pre-computed materialized views for completion rates and at-risk reports rather than running expensive aggregations against the operational PostgreSQL. Real-time at-risk alerts (students who haven't logged in for 7 days, grade below 60%) are generated by a scheduled Kafka Streams job.
Core Components
Gradebook Service
The gradebook is a specialized service separate from the general content API. It manages a sparse matrix of (student × assignment) grade cells. Internally, grades are stored in a column-family structure: a PostgreSQL table with columns for each assignment is impractical for dynamic assignment counts, so grades are stored as rows in an assignments_grades table with (course_id, user_id, assignment_id, score, graded_at, graded_by). For the common query pattern (fetch all grades for a course), a materialized view pivots rows to columns and is refreshed on each grade change via a PostgreSQL trigger. Grade exports to CSV and SIS (Banner, PowerSchool) use a background job to avoid blocking the API.
SCORM/xAPI Compliance Layer
SCORM 1.2 and 2004 packages are uploaded as ZIP archives, validated against the SCORM manifest schema (imsmanifest.xml), and unpacked into S3. The SCORM runtime in the browser communicates via the SCORM JavaScript API (LMSInitialize, LMSSetValue, LMSCommit). The LMS provides a server-side SCORM API endpoint that receives these calls, stores learner state (completion status, score, suspend data) in PostgreSQL, and translates them to xAPI statements for the Learning Record Store. xAPI statements are stored in a dedicated LRS (Learning Locker or SCORM Cloud) that provides a standardized query API.
Notification and Announcement Service
Instructors publish announcements to courses, which fan out to all enrolled students. With 10k students in a large university course, a single announcement triggers 10k notifications. The notification service uses a fan-out-on-write model: on announcement publish, a Kafka topic receives the event; a consumer expands the enrollment list and enqueues individual email/push notifications in batches. Email delivery uses SES with per-tenant sending domains. Push notifications use FCM/APNs via a mobile push gateway. Users control notification preferences (email, push, in-app) which are respected during fan-out.
Database Design
PostgreSQL with schema-per-tenant. Within each tenant schema, core tables: users (user_id, email, role, sso_external_id), courses (course_id, term_id, title, status), enrollments (enrollment_id, user_id, course_id, role, section_id), assignments (assignment_id, course_id, title, due_at, points_possible, submission_type), submissions (submission_id, assignment_id, user_id, submitted_at, file_urls[], body_text), grades (grade_id, submission_id, score, grader_id, graded_at, rubric_data_json). A global tenants table in the public schema maps tenant slugs to schema names and configuration. PgBouncer pools connections per schema. Audit logs (FERPA compliance) are written to an immutable append-only table with row-level security preventing deletes.
API Design
GET /courses/{course_id}/gradebook— returns full grade matrix for the course (all students × all assignments); served from materialized view, cached in Redis per instructor sessionPOST /assignments/{assignment_id}/submissions— multipart upload for file submissions; stores metadata in PostgreSQL, file in S3, triggers plagiarism check asyncPUT /grades/{grade_id}— body:{score, rubric_data, comment}, updates grade, invalidates gradebook cache, records audit log entryPOST /courses/{course_id}/announcements— body:{title, body, notify_by: ["email","push"]}, publishes announcement, triggers async fan-out
Scaling & Bottlenecks
The schema-per-tenant model works up to ~10k tenants on a single PostgreSQL cluster because PostgreSQL's planner handles schema routing efficiently with connection-level search_path setting. Beyond 10k tenants, shard tenants across multiple PostgreSQL clusters by institution size tier (small, medium, large). Connection pooling is critical: at 5M concurrent users × 1 connection per request, a naive approach would require 5M database connections. PgBouncer in transaction-pool mode reduces this to ~500 actual DB connections per shard, with requests waiting in a pool queue (median wait <1ms at normal load).
Gradebook materialized view refresh is a bottleneck for large courses: a course with 10k students and 50 assignments has 500k cells; refreshing the entire view on each grade change is too expensive. Use incremental refresh: maintain a Redis-based dirty set of (course_id) values needing refresh; a background job refreshes dirty courses in batches every 30 seconds. Instructors loading the gradebook see at-most-30-second-stale data, which is acceptable for the grading use case.
Key Trade-offs
- Schema-per-tenant vs. row-level security: Schema isolation is simpler to reason about and prevents cross-tenant data leaks from query bugs, but complicates cross-tenant analytics and increases schema migration complexity; RLS is more flexible but requires rigorous policy testing.
- Materialized view vs. on-demand aggregation for gradebook: Materialized views give sub-second reads but add write overhead and stale-data risk; on-demand aggregation always returns current data but is too slow for 10k-student courses.
- SCORM server-side proxy vs. client-direct S3: Proxying SCORM content through the server enables access control and audit logging but adds egress cost and latency; pre-signed S3 URLs reduce server load but bypass server-side audit logging.
- Multi-tenant cluster vs. dedicated clusters per enterprise: Shared clusters reduce operational overhead and cost for small institutions; large enterprise customers often require dedicated clusters for compliance (data residency, SOC 2 audit scope) and SLA guarantees.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.