System Design: Real-Time ML Inference Service

Requirements

Functional Requirements:

Serve predictions from multiple model types: tabular (XGBoost), neural networks (TensorFlow/PyTorch), and ensemble models
Fetch online features from the feature store and assemble them with request context for inference
Support synchronous (request-response) and asynchronous (fire-and-forget with callback) inference modes
Log every prediction with input features, output prediction, model version, and latency for monitoring
Enable shadow mode: run a new model version on live traffic without returning its predictions to clients
Handle model-specific preprocessing and postprocessing as part of the inference pipeline

Non-Functional Requirements:

P99 latency under 100ms including feature fetch, inference, and postprocessing
Support 100,000 inference requests per second across all deployed models
99.99% availability; degraded-mode fallback to a simpler rule-based prediction when model is unavailable
Auto-scale inference capacity within 2 minutes of a traffic spike
Feature fetch must not add more than 10ms to inference latency

Scale Estimation

100,000 requests/second with a 100ms latency budget means 10,000 concurrent in-flight requests. Each request fetches 50 features from Redis (5ms), runs inference (10ms for XGBoost on CPU, 20ms for NN on GPU), postprocesses (1ms), and logs to Kafka (async, 0ms on critical path). At 100,000 RPS, Redis must sustain 5 million reads/second (50 features per request). A Redis Cluster with 10 nodes handles this comfortably.

High-Level Architecture

The inference service is organized as a chain of handlers: Auth/Rate Limiter → Feature Assembler → Inference Engine → Postprocessor → Response Logger. This chain runs within a single request-handling thread pool. The Feature Assembler fetches online features from the feature store and merges them with request context (payload features passed by the caller). The Inference Engine routes to the appropriate model worker based on the model ID in the request.

Model workers are isolated processes (or containers) that host a single model type. XGBoost models run as Python ONNX inference processes; TensorFlow models run in TFServing; PyTorch models run in TorchServe. An inference orchestrator process dispatches requests to model workers via a local UNIX socket or gRPC, batching multiple requests together when they arrive within a 1ms window. Shadow mode is implemented by forking the feature vector to both the production model and the shadow model, discarding the shadow result but logging it to a separate Kafka topic for offline analysis.

A prediction log Kafka topic captures every inference event: (request_id, user_id, model_id, model_version, feature_vector_hash, prediction, confidence, latency_ms, timestamp). A Flink job aggregates these logs to compute: per-model request rate, average latency, error rate, and prediction distribution drift — all updated in real time and displayed on the ML operations dashboard.

Core Components

Feature Assembler

The feature assembler fetches feature groups in parallel using async I/O (aioredis for Python, Lettuce for Java). Features are organized into groups by entity type (user features, item features, context features). The assembler pre-fetches common feature groups in the connection warm-up phase and caches them in a request-scoped context. Feature validation (check for null values, value range violations) runs after fetching; missing features are replaced with model-registered default values rather than failing the request.

Model Management & Routing

An in-process model registry maps model names and versions to loaded model objects. On startup, the service loads all active model versions from the model registry (MLflow/S3) into memory. Hot model swapping: when a new version is registered in the model registry, the service receives a webhook notification, loads the new version, and begins routing new requests to it while completing in-flight requests with the old version. The old version is unloaded after a 30-second drain period. No downtime, no pod restarts.

Fallback & Degraded Mode

Each model has a fallback configuration: either a simpler version of the same model (e.g., a logistic regression model that requires only request-context features, no feature store lookup) or a hard-coded rule (return the class-prior probability). When the feature store is unavailable (Redis cluster failure), the service falls back to request-context-only features if a partial model is available, or returns the fallback prediction with a X-Prediction-Degraded: true response header. SLA monitoring tracks degraded-mode invocation rate as a key reliability metric.

Database Design

Model catalog (PostgreSQL): model_deployments (deployment_id, model_name, model_version, serving_endpoint, traffic_pct, feature_groups JSON, fallback_config JSON, is_shadow, deployed_at, is_active). Feature cache (Redis): features:{entity_type}:{entity_id} → hash of feature_name → feature_value. Prediction log (Kafka → ClickHouse): (request_id UUID, model_name, model_version, entity_id, features_hash, prediction FLOAT, label_class INT, latency_ms INT, is_degraded BOOL, ts TIMESTAMP).

API Design

POST /predict/{model_name} — Synchronous inference; request body contains entity IDs and optional feature overrides; returns prediction and model version. POST /predict/{model_name}/async — Asynchronous inference; returns a callback token; result delivered via webhook or polled via GET. GET /predict/{model_name}/health — Return model health: loaded version, feature store connectivity, last inference latency. POST /models/{model_name}/shadow — Register a shadow model version to run alongside production without affecting responses.

Scaling & Bottlenecks

CPU-bound inference (XGBoost, logistic regression) scales horizontally by adding inference service replicas. Kubernetes HPA scales based on CPU utilization (target 70%). For GPU-bound models, a separate GPU autoscaling group scales based on GPU utilization and request queue depth (using KEDA Kafka lag trigger). GPU pods have longer startup times (30–90 seconds for model loading); a warm pool of pre-initialized GPU pods reduces scale-out time to under 10 seconds by keeping 2 idle pods always available.

Feature fetch latency is the most common SLA violator. Redis network round-trip in the same AZ is 0.5–1ms; pipelining all feature group fetches into a single Redis PIPELINE command reduces round-trip overhead from O(groups) to O(1). For entities with very hot feature access patterns (top 1% of users), a local in-process LRU cache (Caffeine, capacity 10,000 entities, TTL 5 seconds) eliminates Redis round-trips entirely for these entities.

Key Trade-offs

Online feature fetching vs. pre-joined request payload: Having the caller provide all features eliminates feature store dependency from the critical path but shifts complexity to the caller and risks training-serving skew; server-side feature fetching ensures consistency and simplifies the client contract.
Synchronous vs. asynchronous inference: Synchronous inference is simpler and fits request-response APIs; async inference decouples the client from model latency, enabling long-running models or waiting for results from multiple models before combining.
One service per model vs. multi-model service: One service per model provides complete isolation and independent scaling but increases operational overhead; a multi-model service reduces resource overhead through bin-packing but risks noisy neighbor effects.
Eager loading vs. lazy loading of model versions: Eager loading all versions at startup adds initialization time and memory; lazy loading on first request adds latency for the first user of a model version.