System Design: Code Execution Sandbox

Requirements

Functional Requirements:

Execute user-submitted code in 30+ programming languages (Python, JavaScript, Java, C++, Go, Rust, etc.)
Provide stdin input and capture stdout/stderr output with exit codes
Support multi-file projects with dependency installation (pip, npm, cargo)
Real-time output streaming for long-running programs (live output as it's produced)
Configurable resource limits: CPU time, wall clock time, memory, disk I/O, and network access
Shared execution environments for collaborative coding (multiple users see the same sandbox state)

Non-Functional Requirements:

Execute 10,000 code submissions/sec with P99 latency under 5 seconds for simple programs
Complete isolation: user code cannot affect the host system, other users' code, or access unauthorized resources
Prevent all known escape vectors: container breakout, privilege escalation, kernel exploits
Support submissions up to 10MB of source code with 60-second maximum execution time
Pre-warmed execution environments for sub-second cold start

Scale Estimation

10,000 submissions/sec with average execution time of 3 seconds = 30,000 concurrent sandboxes. Each sandbox requires: 256MB memory (average), 0.5 CPU core = 15TB total RAM, 15,000 CPU cores. At 64 cores / 512GB RAM per server, ~235 servers for the execution fleet. Container image storage: 30 language images × 2GB average = 60GB per node (cached locally). Compilation overhead: 40% of submissions require compilation (C++, Java, Rust) averaging 2 seconds. Output data: 10K submissions/sec × 10KB average output = 100MB/sec.

High-Level Architecture

The system uses a multi-layer isolation architecture. The Submission API receives code, validates it (size limits, language detection), and publishes an execution request to a task queue (Redis-backed, partitioned by language for worker affinity). An Execution Orchestrator assigns the request to an available sandbox worker, considering language affinity (workers with warm containers for that language), current load, and geographic proximity (for latency-sensitive interactive coding).

The Sandbox Worker is the core component. Each worker node runs a pool of pre-warmed container instances using gVisor (a user-space kernel that intercepts system calls, providing a stronger isolation boundary than standard Linux containers). When an execution request arrives, the worker: (1) selects a warm container for the requested language, (2) copies the user's source code into the container via a tmpfs mount, (3) applies resource limits (CPU, memory, wall-clock timeout) via cgroups v2, (4) applies system call filtering via seccomp-bpf profiles (whitelisting only the syscalls needed for that language runtime), (5) executes the code as an unprivileged user inside the container, and (6) captures stdout/stderr and the exit code.

The Output Streaming Layer provides real-time output for interactive use cases. The sandbox worker streams stdout/stderr over a Unix domain socket to a Streaming Service, which forwards output to the client over WebSocket. For non-interactive (batch) submissions, output is buffered and returned in the API response. A Cleanup Service recycles sandboxes after execution: it resets the container filesystem (using overlayfs — discard the upper layer and reset to the clean base image), resets cgroup counters, and returns the sandbox to the warm pool.

Core Components

Container Isolation Stack

The isolation model uses defense-in-depth with four layers. Layer 1 — User namespace: the container runs in a separate user namespace where root inside the container maps to an unprivileged user on the host, preventing privilege escalation. Layer 2 — gVisor (runsc): gVisor implements a user-space Linux kernel (Sentry) that intercepts all system calls from the sandboxed process. Unlike a standard container where syscalls go directly to the host kernel, gVisor interposes a compatibility layer that implements only a safe subset of Linux syscalls (~200 of ~400 syscalls). This prevents kernel exploits from affecting the host. Layer 3 — seccomp-bpf: even within gVisor, a seccomp profile further restricts allowed syscalls to the minimum required for each language (e.g., Python needs read, write, mmap, futex but not mount, ptrace, or socket). Layer 4 — Network isolation: containers have no network interfaces by default (using network namespace isolation). Languages that require network access for dependency installation (pip install) use a separate build phase with a restricted network policy (allow access to package registries only, via an HTTP proxy with a domain whitelist).

Resource Limits & Enforcement

Resource limits use cgroups v2 with the following defaults (configurable per submission): CPU: 2 seconds of CPU time (measured via cpu.stat in the cgroup, not wall clock) with a wall-clock timeout of 60 seconds (catches infinite loops that yield CPU). Memory: 256MB hard limit (cgroup memory.max); exceeding this triggers OOM killer which terminates the process with a clear error message. Disk: a tmpfs mount with 50MB size limit for user code and output; the base filesystem (language runtime, libraries) is read-only via overlayfs. PIDs: maximum 100 processes/threads (pids.max in cgroup) to prevent fork bombs. File descriptors: limited to 256 via setrlimit. The sandbox monitor polls cgroup stats every 100ms and kills the process if any limit is approaching exhaustion, returning a structured error ("Time Limit Exceeded", "Memory Limit Exceeded", "Output Limit Exceeded").

Pre-warming & Container Pool

Cold-starting a container takes 500ms-2 seconds (image layer setup, process initialization). To achieve sub-second execution start, each worker maintains a pool of pre-warmed containers per language. A warm container has the language runtime loaded and is idle, waiting for code injection. Pool sizing is dynamic: the orchestrator tracks submission rates per language and adjusts pool sizes using an AIMD (Additive Increase, Multiplicative Decrease) algorithm. Popular languages (Python, JavaScript) maintain 50+ warm containers per worker; rare languages (Haskell, Erlang) maintain 2-3. Container recycling after execution takes 50ms (overlayfs reset + cgroup reset), making recycled containers nearly as fast as pre-warmed ones.

Database Design

The system is largely stateless — sandboxes are ephemeral. Persistent storage is minimal. PostgreSQL stores submission metadata: submissions (submission_id UUID PK, user_id, language, source_code_hash, status ENUM(queued, running, completed, failed, timeout, oom), created_at, started_at, completed_at, execution_time_ms, memory_peak_bytes, exit_code). Output data (stdout/stderr) is stored in S3 for submissions that request persistence, with a 24-hour default TTL.

Worker state is tracked in Redis: worker_pool:{worker_id} (hash with fields: total_containers, warm_containers:{language}, active_executions, cpu_utilization, memory_utilization). The orchestrator uses this data for scheduling decisions. Language configuration is stored in PostgreSQL: languages (language_id, name, version, docker_image, compile_command nullable, run_command, seccomp_profile_path, default_timeout_ms, default_memory_mb, supported_extensions ARRAY). Execution metrics flow through Kafka to a monitoring system (Prometheus + Grafana) tracking: submission rate, execution time distribution, resource limit violations, and sandbox recycling efficiency.

API Design

POST /api/v1/execute — Submit code for execution; body contains language, source_code, stdin, resource_limits (optional); returns submission_id and result (or queued status)
GET /api/v1/submissions/{submission_id} — Fetch execution result; returns status, stdout, stderr, exit_code, execution_time_ms, memory_peak_bytes
WS /api/v1/execute/stream — WebSocket for interactive execution with real-time output streaming; send code, receive output chunks
GET /api/v1/languages — List supported languages with versions and default resource limits

Scaling & Bottlenecks

The execution fleet is the primary bottleneck. At 30,000 concurrent sandboxes, each requiring 256MB RAM and 0.5 CPU cores, the fleet needs ~235 servers. Auto-scaling based on queue depth adds capacity in 2-3 minutes (provisioning new EC2 instances and pulling container images). During flash traffic (coding competitions with 50K simultaneous users), a queuing system with backpressure prevents overloading: requests beyond capacity are queued with an estimated wait time communicated to the client. Priority scheduling ensures premium users get immediate execution while free-tier users may wait 5-10 seconds during peaks.

Compilation-heavy languages (C++, Rust) are disproportionately expensive. A Rust compilation can take 30 seconds and consume 2GB of memory. The system applies separate resource limits for compilation vs execution and uses compilation caching: if the same source code hash has been compiled before, the cached binary is reused. This cache (stored on local NVMe SSDs) achieves a 30% hit rate for competitive programming platforms where many users submit similar solutions.

Key Trade-offs

gVisor vs standard container isolation (runc): gVisor adds 5-10% execution overhead due to syscall interception but prevents kernel exploit-based container escapes — essential for a multi-tenant code execution platform where adversarial input is expected
Pre-warmed pools vs on-demand container creation: Pre-warming wastes resources on idle containers but provides sub-second start times — the AIMD algorithm balances pool size with utilization efficiency
Per-language seccomp profiles vs a single permissive profile: Per-language profiles minimize the attack surface (Python doesn't need socket syscalls for most use cases) but require maintenance for 30+ languages — automated profile generation via syscall tracing during test runs reduces this burden
Overlayfs recycling vs fresh container creation: Recycling (discard upper layer, reset cgroup) takes 50ms vs 500ms for fresh creation, but risks state leakage if the reset is incomplete — comprehensive filesystem and process cleanup checklists mitigate this risk