SYSTEM_DESIGN
System Design: Real-Time ML Inference Service
Design a production-grade real-time ML inference service that serves predictions from multiple model types with low latency, high availability, and autoscaling. Covers the serving stack, feature assembly, SLA enforcement, and observability for online ML models.
Requirements
Functional Requirements:
- Serve predictions from multiple model types: tabular (XGBoost), neural networks (TensorFlow/PyTorch), and ensemble models
- Fetch online features from the feature store and assemble them with request context for inference
- Support synchronous (request-response) and asynchronous (fire-and-forget with callback) inference modes
- Log every prediction with input features, output prediction, model version, and latency for monitoring
- Enable shadow mode: run a new model version on live traffic without returning its predictions to clients
- Handle model-specific preprocessing and postprocessing as part of the inference pipeline
Non-Functional Requirements:
- P99 latency under 100ms including feature fetch, inference, and postprocessing
- Support 100,000 inference requests per second across all deployed models
- 99.99% availability; degraded-mode fallback to a simpler rule-based prediction when model is unavailable
- Auto-scale inference capacity within 2 minutes of a traffic spike
- Feature fetch must not add more than 10ms to inference latency
Scale Estimation
100,000 requests/second with a 100ms latency budget means 10,000 concurrent in-flight requests. Each request fetches 50 features from Redis (5ms), runs inference (10ms for XGBoost on CPU, 20ms for NN on GPU), postprocesses (1ms), and logs to Kafka (async, 0ms on critical path). At 100,000 RPS, Redis must sustain 5 million reads/second (50 features per request). A Redis Cluster with 10 nodes handles this comfortably.
High-Level Architecture
The inference service is organized as a chain of handlers: Auth/Rate Limiter → Feature Assembler → Inference Engine → Postprocessor → Response Logger. This chain runs within a single request-handling thread pool. The Feature Assembler fetches online features from the feature store and merges them with request context (payload features passed by the caller). The Inference Engine routes to the appropriate model worker based on the model ID in the request.
Model workers are isolated processes (or containers) that host a single model type. XGBoost models run as Python ONNX inference processes; TensorFlow models run in TFServing; PyTorch models run in TorchServe. An inference orchestrator process dispatches requests to model workers via a local UNIX socket or gRPC, batching multiple requests together when they arrive within a 1ms window. Shadow mode is implemented by forking the feature vector to both the production model and the shadow model, discarding the shadow result but logging it to a separate Kafka topic for offline analysis.
A prediction log Kafka topic captures every inference event: (request_id, user_id, model_id, model_version, feature_vector_hash, prediction, confidence, latency_ms, timestamp). A Flink job aggregates these logs to compute: per-model request rate, average latency, error rate, and prediction distribution drift — all updated in real time and displayed on the ML operations dashboard.
Core Components
Feature Assembler
The feature assembler fetches feature groups in parallel using async I/O (aioredis for Python, Lettuce for Java). Features are organized into groups by entity type (user features, item features, context features). The assembler pre-fetches common feature groups in the connection warm-up phase and caches them in a request-scoped context. Feature validation (check for null values, value range violations) runs after fetching; missing features are replaced with model-registered default values rather than failing the request.
Model Management & Routing
An in-process model registry maps model names and versions to loaded model objects. On startup, the service loads all active model versions from the model registry (MLflow/S3) into memory. Hot model swapping: when a new version is registered in the model registry, the service receives a webhook notification, loads the new version, and begins routing new requests to it while completing in-flight requests with the old version. The old version is unloaded after a 30-second drain period. No downtime, no pod restarts.
Fallback & Degraded Mode
Each model has a fallback configuration: either a simpler version of the same model (e.g., a logistic regression model that requires only request-context features, no feature store lookup) or a hard-coded rule (return the class-prior probability). When the feature store is unavailable (Redis cluster failure), the service falls back to request-context-only features if a partial model is available, or returns the fallback prediction with a X-Prediction-Degraded: true response header. SLA monitoring tracks degraded-mode invocation rate as a key reliability metric.
Database Design
Model catalog (PostgreSQL): model_deployments (deployment_id, model_name, model_version, serving_endpoint, traffic_pct, feature_groups JSON, fallback_config JSON, is_shadow, deployed_at, is_active). Feature cache (Redis): features:{entity_type}:{entity_id} → hash of feature_name → feature_value. Prediction log (Kafka → ClickHouse): (request_id UUID, model_name, model_version, entity_id, features_hash, prediction FLOAT, label_class INT, latency_ms INT, is_degraded BOOL, ts TIMESTAMP).
API Design
POST /predict/{model_name} — Synchronous inference; request body contains entity IDs and optional feature overrides; returns prediction and model version.
POST /predict/{model_name}/async — Asynchronous inference; returns a callback token; result delivered via webhook or polled via GET.
GET /predict/{model_name}/health — Return model health: loaded version, feature store connectivity, last inference latency.
POST /models/{model_name}/shadow — Register a shadow model version to run alongside production without affecting responses.
Scaling & Bottlenecks
CPU-bound inference (XGBoost, logistic regression) scales horizontally by adding inference service replicas. Kubernetes HPA scales based on CPU utilization (target 70%). For GPU-bound models, a separate GPU autoscaling group scales based on GPU utilization and request queue depth (using KEDA Kafka lag trigger). GPU pods have longer startup times (30–90 seconds for model loading); a warm pool of pre-initialized GPU pods reduces scale-out time to under 10 seconds by keeping 2 idle pods always available.
Feature fetch latency is the most common SLA violator. Redis network round-trip in the same AZ is 0.5–1ms; pipelining all feature group fetches into a single Redis PIPELINE command reduces round-trip overhead from O(groups) to O(1). For entities with very hot feature access patterns (top 1% of users), a local in-process LRU cache (Caffeine, capacity 10,000 entities, TTL 5 seconds) eliminates Redis round-trips entirely for these entities.
Key Trade-offs
- Online feature fetching vs. pre-joined request payload: Having the caller provide all features eliminates feature store dependency from the critical path but shifts complexity to the caller and risks training-serving skew; server-side feature fetching ensures consistency and simplifies the client contract.
- Synchronous vs. asynchronous inference: Synchronous inference is simpler and fits request-response APIs; async inference decouples the client from model latency, enabling long-running models or waiting for results from multiple models before combining.
- One service per model vs. multi-model service: One service per model provides complete isolation and independent scaling but increases operational overhead; a multi-model service reduces resource overhead through bin-packing but risks noisy neighbor effects.
- Eager loading vs. lazy loading of model versions: Eager loading all versions at startup adds initialization time and memory; lazy loading on first request adds latency for the first user of a model version.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.