SYSTEM_DESIGN
System Design: Online Coding Judge (LeetCode-style)
Design a scalable online coding judge system like LeetCode that safely executes user-submitted code in multiple languages, evaluates correctness against test cases, and returns results in real time. Covers sandboxed execution, plagiarism detection, and contest mode.
Requirements
Functional Requirements:
- Users submit code solutions in 15+ programming languages (Python, Java, C++, Go, etc.) for algorithmic problems
- System executes submissions against hidden test cases and returns pass/fail with runtime and memory metrics
- Problems have difficulty ratings, editorial solutions, and discussion threads
- Contest mode supports timed competitions with real-time leaderboard updates
- Plagiarism detection flags suspiciously similar submissions across users
- Users can run code against custom inputs before final submission
Non-Functional Requirements:
- Execution sandbox must be fully isolated — no network access, limited CPU and memory per run
- Judge latency: 95th percentile result within 5 seconds for standard problems
- Support 100k concurrent users with 10k simultaneous code executions during peak contests
- Zero tolerance for sandbox escape — kernel-level isolation required
- 99.95% availability for the judge service
Scale Estimation
At 100k concurrent users during a major contest, assume 30% are actively submitting — that's 30k submissions in a burst window. Each submission runs against 50-200 test cases sequentially, taking 0.1-2 seconds each. A single judge worker can handle ~5 submissions/minute. To handle 30k submissions in under 10 minutes requires ~1,000 judge workers running in parallel. Worker fleet must auto-scale from a baseline of 200 (off-contest) to 2,000 (peak contest). Storage for submissions: 100 bytes/submission average, 500k submissions/day = 50 MB/day of code, trivial compared to test case I/O data.
High-Level Architecture
The system separates into three planes: the problem and user plane, the execution plane, and the analytics plane. The problem and user plane is a standard web application — users browse problems, view editorial solutions, and submit code via a REST API backed by PostgreSQL and Redis. The execution plane is the core differentiator: submissions are placed on a priority queue (Kafka or SQS), pulled by judge workers running on isolated VMs, executed in gVisor or nsjail sandboxes, and results are written back to the database and pushed to the user via WebSocket.
The execution plane is intentionally decoupled from the web plane. Judge workers run on dedicated EC2 instances (not shared with the web tier) with strict security group rules — no outbound internet, no cross-worker communication. Each worker manages a pool of sandbox containers pre-warmed for each supported language runtime. Pre-warming eliminates JVM/Python interpreter startup latency from the critical path. Workers pull jobs from a queue, execute, and ACK on completion. A dead-letter queue handles timed-out or crashed executions.
Contest mode adds a real-time leaderboard component. Submission results are published to a Redis Sorted Set (keyed by contest_id, scored by penalty time per ICPC rules). A WebSocket server pushes leaderboard deltas to connected clients at 1-second intervals. During peak contests, the WebSocket tier scales horizontally with sticky sessions backed by Redis pub/sub for cross-instance fan-out.
Core Components
Sandbox Execution Engine
Each code submission is executed inside a nsjail or gVisor sandbox that enforces: no network syscalls (seccomp filter blocks socket(), connect()), memory limit (256 MB default, configurable per problem), CPU time limit (1-10 seconds), and a chroot filesystem with only the language runtime and standard library. The sandbox starts fresh for each submission — no shared state between runs. For interpreted languages (Python), a pre-forked interpreter process is reused across submissions within the same worker to amortize startup cost, with memory state reset between runs using a lightweight checkpoint/restore mechanism.
Test Case Evaluation Engine
Test cases are stored in S3 (input/output file pairs) and cached on judge worker local NVMe storage. The evaluation engine runs the user's compiled binary against each test case, comparing stdout against the expected output with configurable comparison modes: exact match, whitespace-normalized, or custom checker (for problems with multiple valid outputs, e.g., graph problems). Runtime and peak memory are measured via Linux cgroups v2. Results are aggregated: "Accepted", "Wrong Answer" (with the failing test case index), "Time Limit Exceeded", "Memory Limit Exceeded", "Runtime Error", or "Compilation Error".
Plagiarism Detection Service
The plagiarism detection service runs asynchronously after each submission is accepted. It uses a token-based similarity algorithm (similar to MOSS — Measure Of Software Similarity) that normalizes code by removing variable names and formatting, then computes pairwise Jaccard similarity on sliding windows of token n-grams. Submissions above a configurable threshold (e.g., 80% similarity) are flagged for manual review. The service processes submissions in nightly batches for efficiency, storing fingerprints (MinHash signatures) in a columnar store (ClickHouse) for fast similarity queries across millions of past submissions.
Database Design
PostgreSQL stores problems, users, and submission metadata. Key tables: problems (problem_id, title, difficulty, time_limit_ms, memory_limit_mb, test_case_count), submissions (submission_id, user_id, problem_id, language, code, status, runtime_ms, memory_kb, submitted_at), test_results (result_id, submission_id, test_case_index, status, runtime_ms), contest_participants (contest_id, user_id, score, penalty_ms). The submissions table grows fast — partition by month with pg_partman. The code column is stored as text; for large solutions consider S3 offload above 64 KB. Redis caches problem metadata, user session state, and contest leaderboards (Sorted Sets).
API Design
POST /submissions— body:{problem_id, language, code}, enqueues submission, returns{submission_id, status: "queued"}; rate-limited to 5 submissions/minute per userGET /submissions/{submission_id}— returns submission status, result, runtime, memory; polled by frontend or received via WebSocket pushPOST /run— body:{language, code, stdin}, runs code against custom input in sandbox, returns stdout/stderr within 10s timeout; not persistedGET /problems/{problem_id}— returns problem statement, constraints, examples, tags; served from Redis cache with 1-hour TTLGET /contests/{contest_id}/leaderboard— returns top-N participants with scores; served from Redis Sorted Set
Scaling & Bottlenecks
The judge worker fleet is the primary scaling constraint. Workers are stateful (pre-warmed language runtimes) and require dedicated CPU cores — they cannot be over-provisioned on shared VMs. During a LeetCode-scale weekly contest with 50k participants, the submission burst in the first 10 minutes can hit 5k submissions/minute. Pre-scaling the worker fleet 30 minutes before contest start (using scheduled auto-scaling) is more reliable than reactive scaling, which has a 3-5 minute warm-up lag. Spot instances reduce cost by 70% but require a fallback on-demand pool to avoid capacity gaps during AWS Spot reclamation events.
The database write path for submission results is a bottleneck at scale: 10k concurrent judge workers each writing a result creates ~10k writes/second. Batching result writes (workers accumulate 10 results before flushing to DB) and using PostgreSQL's COPY for bulk inserts reduces lock contention. The WebSocket tier for contest leaderboards must handle 50k persistent connections — use a connection-aware load balancer (NLB at L4) with Redis pub/sub for cross-instance message delivery rather than HTTP/2 server-sent events which struggle with this connection count.
Key Trade-offs
- gVisor vs. nsjail: gVisor provides stronger syscall interception (user-space kernel) with ~15% CPU overhead per run; nsjail uses Linux namespaces with lower overhead but relies on seccomp policy correctness. For a public judge, gVisor's defense-in-depth justifies the performance cost.
- Pre-warm vs. cold-start containers: Pre-warming language runtimes reduces p95 latency from ~3s to ~200ms but consumes memory on idle workers; the trade-off favors pre-warming for the top 5 languages by submission volume.
- Synchronous vs. async result delivery: Polling with exponential backoff is simpler than WebSocket but adds latency; WebSocket push is used in contest mode where result latency directly affects leaderboard accuracy.
- Storing full code in DB vs. S3: Storing code in PostgreSQL simplifies queries (plagiarism detection, editorial comparison) but bloats the table; S3 offload with a metadata pointer is better above 10M submissions.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.