System Design: Online Coding Judge (LeetCode-style)

Requirements

Functional Requirements:

Users submit code solutions in 15+ programming languages (Python, Java, C++, Go, etc.) for algorithmic problems
System executes submissions against hidden test cases and returns pass/fail with runtime and memory metrics
Problems have difficulty ratings, editorial solutions, and discussion threads
Contest mode supports timed competitions with real-time leaderboard updates
Plagiarism detection flags suspiciously similar submissions across users
Users can run code against custom inputs before final submission

Non-Functional Requirements:

Execution sandbox must be fully isolated — no network access, limited CPU and memory per run
Judge latency: 95th percentile result within 5 seconds for standard problems
Support 100k concurrent users with 10k simultaneous code executions during peak contests
Zero tolerance for sandbox escape — kernel-level isolation required
99.95% availability for the judge service

Scale Estimation

At 100k concurrent users during a major contest, assume 30% are actively submitting — that's 30k submissions in a burst window. Each submission runs against 50-200 test cases sequentially, taking 0.1-2 seconds each. A single judge worker can handle ~5 submissions/minute. To handle 30k submissions in under 10 minutes requires ~1,000 judge workers running in parallel. Worker fleet must auto-scale from a baseline of 200 (off-contest) to 2,000 (peak contest). Storage for submissions: 100 bytes/submission average, 500k submissions/day = 50 MB/day of code, trivial compared to test case I/O data.

High-Level Architecture

The system separates into three planes: the problem and user plane, the execution plane, and the analytics plane. The problem and user plane is a standard web application — users browse problems, view editorial solutions, and submit code via a REST API backed by PostgreSQL and Redis. The execution plane is the core differentiator: submissions are placed on a priority queue (Kafka or SQS), pulled by judge workers running on isolated VMs, executed in gVisor or nsjail sandboxes, and results are written back to the database and pushed to the user via WebSocket.

The execution plane is intentionally decoupled from the web plane. Judge workers run on dedicated EC2 instances (not shared with the web tier) with strict security group rules — no outbound internet, no cross-worker communication. Each worker manages a pool of sandbox containers pre-warmed for each supported language runtime. Pre-warming eliminates JVM/Python interpreter startup latency from the critical path. Workers pull jobs from a queue, execute, and ACK on completion. A dead-letter queue handles timed-out or crashed executions.

Contest mode adds a real-time leaderboard component. Submission results are published to a Redis Sorted Set (keyed by contest_id, scored by penalty time per ICPC rules). A WebSocket server pushes leaderboard deltas to connected clients at 1-second intervals. During peak contests, the WebSocket tier scales horizontally with sticky sessions backed by Redis pub/sub for cross-instance fan-out.

Core Components

Sandbox Execution Engine

Each code submission is executed inside a nsjail or gVisor sandbox that enforces: no network syscalls (seccomp filter blocks socket(), connect()), memory limit (256 MB default, configurable per problem), CPU time limit (1-10 seconds), and a chroot filesystem with only the language runtime and standard library. The sandbox starts fresh for each submission — no shared state between runs. For interpreted languages (Python), a pre-forked interpreter process is reused across submissions within the same worker to amortize startup cost, with memory state reset between runs using a lightweight checkpoint/restore mechanism.

Test Case Evaluation Engine

Test cases are stored in S3 (input/output file pairs) and cached on judge worker local NVMe storage. The evaluation engine runs the user's compiled binary against each test case, comparing stdout against the expected output with configurable comparison modes: exact match, whitespace-normalized, or custom checker (for problems with multiple valid outputs, e.g., graph problems). Runtime and peak memory are measured via Linux cgroups v2. Results are aggregated: "Accepted", "Wrong Answer" (with the failing test case index), "Time Limit Exceeded", "Memory Limit Exceeded", "Runtime Error", or "Compilation Error".

Plagiarism Detection Service

The plagiarism detection service runs asynchronously after each submission is accepted. It uses a token-based similarity algorithm (similar to MOSS — Measure Of Software Similarity) that normalizes code by removing variable names and formatting, then computes pairwise Jaccard similarity on sliding windows of token n-grams. Submissions above a configurable threshold (e.g., 80% similarity) are flagged for manual review. The service processes submissions in nightly batches for efficiency, storing fingerprints (MinHash signatures) in a columnar store (ClickHouse) for fast similarity queries across millions of past submissions.

Database Design

PostgreSQL stores problems, users, and submission metadata. Key tables: problems (problem_id, title, difficulty, time_limit_ms, memory_limit_mb, test_case_count), submissions (submission_id, user_id, problem_id, language, code, status, runtime_ms, memory_kb, submitted_at), test_results (result_id, submission_id, test_case_index, status, runtime_ms), contest_participants (contest_id, user_id, score, penalty_ms). The submissions table grows fast — partition by month with pg_partman. The code column is stored as text; for large solutions consider S3 offload above 64 KB. Redis caches problem metadata, user session state, and contest leaderboards (Sorted Sets).

API Design

POST /submissions — body: {problem_id, language, code}, enqueues submission, returns {submission_id, status: "queued"}; rate-limited to 5 submissions/minute per user
GET /submissions/{submission_id} — returns submission status, result, runtime, memory; polled by frontend or received via WebSocket push
POST /run — body: {language, code, stdin}, runs code against custom input in sandbox, returns stdout/stderr within 10s timeout; not persisted
GET /problems/{problem_id} — returns problem statement, constraints, examples, tags; served from Redis cache with 1-hour TTL
GET /contests/{contest_id}/leaderboard — returns top-N participants with scores; served from Redis Sorted Set

Scaling & Bottlenecks

The judge worker fleet is the primary scaling constraint. Workers are stateful (pre-warmed language runtimes) and require dedicated CPU cores — they cannot be over-provisioned on shared VMs. During a LeetCode-scale weekly contest with 50k participants, the submission burst in the first 10 minutes can hit 5k submissions/minute. Pre-scaling the worker fleet 30 minutes before contest start (using scheduled auto-scaling) is more reliable than reactive scaling, which has a 3-5 minute warm-up lag. Spot instances reduce cost by 70% but require a fallback on-demand pool to avoid capacity gaps during AWS Spot reclamation events.

The database write path for submission results is a bottleneck at scale: 10k concurrent judge workers each writing a result creates ~10k writes/second. Batching result writes (workers accumulate 10 results before flushing to DB) and using PostgreSQL's COPY for bulk inserts reduces lock contention. The WebSocket tier for contest leaderboards must handle 50k persistent connections — use a connection-aware load balancer (NLB at L4) with Redis pub/sub for cross-instance message delivery rather than HTTP/2 server-sent events which struggle with this connection count.

Key Trade-offs

gVisor vs. nsjail: gVisor provides stronger syscall interception (user-space kernel) with ~15% CPU overhead per run; nsjail uses Linux namespaces with lower overhead but relies on seccomp policy correctness. For a public judge, gVisor's defense-in-depth justifies the performance cost.
Pre-warm vs. cold-start containers: Pre-warming language runtimes reduces p95 latency from ~3s to ~200ms but consumes memory on idle workers; the trade-off favors pre-warming for the top 5 languages by submission volume.
Synchronous vs. async result delivery: Polling with exponential backoff is simpler than WebSocket but adds latency; WebSocket push is used in contest mode where result latency directly affects leaderboard accuracy.
Storing full code in DB vs. S3: Storing code in PostgreSQL simplifies queries (plagiarism detection, editorial comparison) but bloats the table; S3 offload with a metadata pointer is better above 10M submissions.