System Design: Batch Processing System

Requirements

Functional Requirements:

Execute large-scale data transformation jobs (sorting, aggregation, joins, ML feature computation)
Schedule jobs with cron-based and dependency-based triggers
Support multi-tenant job submission with resource quotas per team
Provide job monitoring: logs, metrics, progress tracking, and failure alerts
Allow job parameterization and support for templated job libraries
Handle job dependencies: Job B starts only after Job A succeeds

Non-Functional Requirements:

Process 10 PB per day across all jobs within nightly batch windows
Individual job failures must not affect other running jobs
Resource utilization target of 75% across the cluster during batch windows
Cost-optimized: use spot/preemptible instances for 80% of compute capacity
Job SLA tracking with automatic escalation if a job exceeds its time budget

Scale Estimation

A large-scale batch platform runs 5,000 jobs per day averaging 2 TB input per job = 10 PB/day. Peak concurrency is 500 simultaneous jobs. Each Spark job uses an average of 20 cores and 80 GB RAM, so the cluster needs 10,000 cores and 40 TB RAM at peak. With spot instances at 80% mix, cluster cost is reduced by ~65% vs. on-demand pricing.

High-Level Architecture

The platform centers on a resource manager (YARN or Kubernetes) that allocates cluster resources across competing jobs. Jobs are submitted via a REST API or CLI to a Job Service, which validates the job definition, resolves dependencies, and enqueues it in the scheduler. The scheduler (Airflow, Oozie, or a custom priority queue) dispatches ready jobs to the resource manager when capacity is available.

Compute is provided by Apache Spark on Kubernetes (Spark Operator) or EMR on EC2. Spot instance pools are organized by instance type and AZ; a spot interruption handler gracefully checkpoints running tasks and reschedules them on on-demand capacity within 30 seconds of a spot reclamation notice. Data locality is enforced by launching Spark executors on nodes where the relevant S3 data is cached (Alluxio) to minimize network I/O.

A cost optimization layer continuously right-sizes cluster requests: if a job's historical executor utilization is below 50%, the system recommends (or auto-applies) reduced resource requests. Idle cluster nodes are returned to the instance pool after 5 minutes of inactivity to minimize cost.

Core Components

Job Scheduler

The scheduler maintains a DAG of job dependencies and a priority queue sorted by SLA urgency and team priority class. It polls the resource manager for available capacity and dispatches the highest-priority eligible job. Back-filling allows lower-priority jobs to use capacity that would otherwise sit idle. SLA tracking records expected completion times and fires PagerDuty alerts with 30-minute SLA breach warnings.

Distributed Execution Engine

Apache Spark's driver coordinates a distributed execution plan across 10–1000 executor JVMs. The catalyst optimizer rewrites SQL queries into optimal physical plans using predicate pushdown, partition pruning, and join reordering. Dynamic partition pruning (DPP) reduces shuffle data by filtering partitions at runtime based on the results of a broadcast hash join. Adaptive Query Execution (AQE) re-optimizes the plan mid-execution using runtime statistics.

Checkpoint & Recovery Manager

For long-running jobs (>1 hour), the system periodically checkpoints intermediate shuffle data and stage outputs to S3. On failure, the job restarts from the last successful stage rather than re-executing the entire job. Idempotent writes (write to a temp path, atomic rename on success) prevent partial writes from corrupting target tables. A dead letter queue captures records that fail transformation after 3 retries for manual inspection.

Database Design

A job metadata store (PostgreSQL) contains: jobs (job_id, name, owner_team, resource_request, schedule_cron, sla_minutes, status), job_runs (run_id, job_id, start_time, end_time, status, input_bytes, output_bytes, cost_usd, spark_app_id), and job_dependencies (upstream_job_id, downstream_job_id). A time-series metrics store (TimescaleDB or InfluxDB) stores per-job resource utilization sampled every 30 seconds for capacity planning.

API Design

POST /jobs — Register a new job definition with resource requests, schedule, dependencies, and SLA. POST /jobs/{job_id}/runs — Trigger an immediate run with optional parameter overrides. GET /jobs/{job_id}/runs/{run_id}/logs — Stream executor logs from the Spark History Server. GET /cluster/capacity — Return current cluster utilization, spot availability, and queue depth by priority class.

Scaling & Bottlenecks

The scheduler becomes a bottleneck when tracking thousands of concurrent jobs with complex dependency graphs. Moving from a centralized scheduler to a distributed priority queue (backed by Redis Sorted Sets) and a pull-based dispatch model allows horizontal scaling of scheduler workers. The resource manager's API server also becomes a bottleneck; rate limiting job submission and batching status updates reduces API server load.

Shuffle spill to disk is the most common cause of job slowdowns: when shuffle data exceeds executor memory, it spills to local disk, reducing throughput by 10x. Solutions include increasing executor memory, repartitioning to increase parallelism and reduce per-partition size, and using Spark's off-heap memory for shuffle buffers. External shuffle service (decoupled from executor lifecycle) allows executor restarts without losing shuffle data, improving spot tolerance.

Key Trade-offs

Spot vs. on-demand instances: Spot reduces cost by 65% but introduces preemption risk; using spot for map stages (stateless) and on-demand for reduce/aggregation stages (stateful) balances cost and reliability.
YARN vs. Kubernetes: YARN is mature for Hadoop-centric workloads with fine-grained resource scheduling; Kubernetes provides container isolation, better multi-tenancy, and supports non-Spark workloads but has higher Spark scheduling overhead.
Coarse vs. fine-grained resource allocation: Coarse allocation (allocate full cluster per job) simplifies scheduling but wastes resources; fine-grained allocation (per-task cores) maximizes utilization but increases scheduling complexity.
Local shuffle vs. remote shuffle service: Local shuffle is faster but ties executor memory and disk to shuffle data; remote shuffle service (Uber Zeus, Spark on Disaggregated Shuffle) decouples compute from shuffle storage, enabling independent scaling.