System Design: Model Training Pipeline

Requirements

Functional Requirements:

Ingest training datasets from the data lake and feature store with point-in-time correctness
Execute distributed training jobs across multi-GPU, multi-node clusters
Track experiments: hyperparameters, metrics per epoch, system metrics, and artifacts
Support hyperparameter optimization (HPO) with parallel trial execution
Produce model artifacts (weights, tokenizer, preprocessing pipeline) stored with full provenance
Automate evaluation: run held-out test set evaluation and generate comparison reports vs. baseline

Non-Functional Requirements:

Training infrastructure must sustain 90% GPU utilization during active jobs
Experiment tracking must not add more than 5% overhead to training throughput
Failed training runs must checkpoint at least every 10 minutes for recovery
Support jobs ranging from single-GPU fine-tuning (minutes) to 1,000-GPU pre-training (weeks)
All training runs must be reproducible given the same data snapshot and random seed

Scale Estimation

A large ML team runs 500 training experiments per day: 400 are small (<4 GPUs, <1 hour) and 100 are large (16–512 GPUs, 1–72 hours). GPU cluster: 2,000 A100 GPUs across 250 nodes (8 GPUs/node). Peak utilization: 1,800 GPUs busy = 90%. Data pipeline: reading 10 TB of training data per large job at 10 GB/s NVMe throughput requires pre-staged data on local cluster storage.

High-Level Architecture

The training pipeline has four stages: Data Preparation, Training Orchestration, Experiment Tracking, and Model Evaluation. An orchestrator (Kubeflow Pipelines, MLflow Projects, or a custom DAG runner) manages the stage DAG for each training run and handles retries, resource allocation, and artifact passing between stages.

Data Preparation reads from the offline feature store and data lake, applies train/validation/test splits, normalizes features, and writes batches to a fast local cache (NVMe SSDs or FUSE-mounted S3 with local caching). This stage runs as a CPU-only preprocessing job before GPU training begins, ensuring GPU utilization is not throttled by data loading I/O. The dataset is sharded into files of 128 MB each and shuffled at shard level to ensure data diversity per worker.

Distributed training runs across multiple nodes using PyTorch DDP or DeepSpeed for LLM training. The training launcher (torchrun or Horovod) starts one process per GPU, initializing a NCCL process group for gradient synchronization. The forward pass and backward pass run on each GPU independently; after each mini-batch, an all-reduce operation averages gradients across all workers. Gradient accumulation allows effective batch sizes of 65,536+ tokens for language models.

Core Components

Distributed Training Coordinator

Kubeflow's PyTorchJob or MPI Operator launches the master and worker pods on the GPU cluster. The master coordinates checkpointing and epoch transitions; workers execute forward/backward passes. Elastic training (PyTorch Elastic / TorchElastic) handles spot GPU preemption by dynamically adjusting the worker count while the job continues. After preemption, the job restores from the last checkpoint and resumes with the remaining GPUs.

Experiment Tracker (MLflow)

Every training run logs to MLflow: hyperparameters at run start, training and validation loss/metrics after each epoch, system metrics (GPU utilization, memory, throughput tokens/sec) every 30 seconds, and artifact paths (checkpoints, tokenizer, eval reports) on completion. MLflow UI allows side-by-side comparison of run metrics across hyperparameter configurations. The tracking server stores metadata in PostgreSQL and artifacts in S3.

Hyperparameter Optimization

Optuna or Ray Tune manages the HPO search. A study defines the search space (learning rate log-uniform 1e-5 to 1e-2, batch size {32,64,128,256}, warmup steps integer 100-2000). The Bayesian optimization sampler (TPE) suggests the next configuration based on prior trial results. Parallel trials run simultaneously on available GPU resources. Pruning (Hyperband) terminates poorly performing trials early by comparing intermediate metrics against the best-seen curve, reducing total GPU-hours by 60%.

Database Design

MLflow PostgreSQL schema: experiments (experiment_id, name, artifact_location, lifecycle_stage), runs (run_id, experiment_id, status, start_time, end_time, artifact_uri), params (run_id, key, value), metrics (run_id, key, value, timestamp, step), tags (run_id, key, value). Checkpoint objects are stored in S3 at path s3://models/{experiment_id}/{run_id}/checkpoints/epoch_{n}/. A separate training_datasets table records: (dataset_id, feature_store_version, split_config_hash, s3_path, row_count, created_at) for reproducibility._

API Design

POST /training-runs — Submit a training job with model config, dataset ref, hyperparameters, and resource request; returns run_id. GET /training-runs/{run_id}/metrics — Stream real-time training metrics as server-sent events during training. POST /training-runs/{run_id}/checkpoints/{epoch}/restore — Restore a training run from a specific checkpoint. POST /hpo/studies — Create a hyperparameter optimization study with search space and optimization objective.

Scaling & Bottlenecks

Data loading throughput is the most common GPU underutilization cause. PyTorch DataLoader with num_workers=8 and pin_memory=True pre-fetches batches asynchronously; NVIDIA DALI moves data preprocessing onto the GPU pipeline. For very large datasets (>100 TB), WebDataset format stores data as streaming tar archives read sequentially from S3, eliminating random-access overhead while maintaining shuffling via shard-level randomization.

Gradient synchronization overhead scales with model size and number of workers. For large models (10B+ parameters), ZeRO (Zero Redundancy Optimizer, DeepSpeed) shards optimizer states, gradients, and parameters across workers, enabling training of models too large for a single GPU's memory. Pipeline parallelism (dividing the model vertically across GPUs) further reduces per-GPU memory requirements for the largest models.

Key Trade-offs

Synchronous DDP vs. asynchronous SGD: Synchronous DDP ensures mathematically identical results to single-GPU training but requires all workers to sync after each step, making slow stragglers drag down the entire job; asynchronous SGD allows workers to proceed independently but introduces gradient staleness.
Spot vs. on-demand GPU instances: Spot instances reduce cost by 70% but require fault-tolerant checkpointing; elastic training handles spot preemption gracefully but adds implementation complexity.
Centralized experiment tracking vs. local logging: MLflow server centralizes comparison and collaboration but becomes a throughput bottleneck if every training step logs metrics synchronously; batching metric logs and async flushing maintains tracking without impacting training throughput.
Full checkpoint vs. incremental checkpoint: Full checkpoints (saving all model weights) are simple but large (10–100 GB for big models); incremental checkpoints (saving weight deltas) reduce checkpoint size by 90% but add restore complexity.