System Design: LLM Serving Infrastructure

Requirements

Functional Requirements:

Serve multiple LLM models (7B to 70B parameters) with streaming token generation
Support multi-turn conversations with context window management up to 128K tokens
Route requests to appropriate model sizes based on query complexity and cost constraints
Handle function calling / tool use: parse structured outputs and invoke tool APIs
Support parameter-efficient fine-tuned model variants (LoRA adapters) without full model reload
Implement per-user rate limiting and priority queuing for enterprise vs. free-tier users

Non-Functional Requirements:

Time to first token (TTFT) under 500ms at the 95th percentile
Throughput of 1,000 output tokens/second per A100 GPU
Support 10,000 concurrent streaming connections
99.5% availability; graceful degradation to smaller models during GPU saturation
Cost per 1M output tokens optimized to within 2x of theoretical GPU utilization maximum

Scale Estimation

At 1,000 concurrent users each generating 200 tokens, throughput = 200,000 tokens/second. One A100 GPU serves ~1,000 tokens/second for a 70B model in FP16 (using tensor parallelism across 4 GPUs), or ~4,000 tokens/second for a 7B model on a single A100. Serving 200,000 tokens/second with a mix of 70B (40%) and 7B (60%) models: 200,000 * 0.4 / 1,000 = 80 A100 GPUs for 70B, 200,000 * 0.6 / 4,000 = 30 A100s for 7B. Total: 110 A100 GPUs.

High-Level Architecture

The LLM serving infrastructure has three layers: the API Gateway, the Request Router, and the Inference Backend. The API Gateway handles authentication, rate limiting, request validation, and streaming response delivery. The Request Router classifies requests by model, priority, and resource requirements, then dispatches to the appropriate inference cluster. The Inference Backend runs the LLM forward pass using a high-throughput inference engine (vLLM, TensorRT-LLM, or TGI).

vLLM's PagedAttention is the key innovation enabling high GPU utilization: instead of pre-allocating maximum KV cache per request (which wastes memory), PagedAttention dynamically allocates KV cache memory in 16-token pages as tokens are generated, allowing 3x more concurrent requests per GPU. Continuous batching (also called iteration-level scheduling) processes new requests alongside ongoing generation requests at every decoding step, eliminating the GPU idle time caused by waiting for all batch members to finish.

Context management is critical for multi-turn conversations. Each conversation's KV cache is materialized on GPU memory during generation. When a conversation's context exceeds the model's context window, a context compression step (summarization or sliding window) reduces it to fit. For idle conversations (user hasn't sent a message in >30 seconds), the KV cache is evicted from GPU to CPU (with 50ms reload latency) to free GPU memory for active requests.

Core Components

vLLM Inference Engine

vLLM implements PagedAttention, continuous batching, and tensor parallelism. For 70B models (too large for one A100 80GB GPU), tensor parallelism splits attention heads and FFN weights across 4 or 8 GPUs, communicating via NVLink all-reduce. The scheduler processes decode steps across all in-flight sequences in a single forward pass; newly arrived prompts are inserted into the batch at the next decode step without waiting for current sequences to complete. Throughput is 2–4x higher than naive batching implementations.

LoRA Adapter Management

LoRA adapters for fine-tuned model variants (customer-specific, domain-specific) are small (50–200 MB vs. the 140 GB base 70B model). The serving engine maintains a pool of loaded LoRA adapters in GPU memory (up to 20 adapters simultaneously). When a request specifies a LoRA adapter, the serving engine applies the low-rank weight delta on-the-fly during inference with negligible overhead (<5% latency increase). New adapters are loaded asynchronously from S3 while existing requests are served; adapter eviction follows an LRU policy.

Request Router & Model Selection

A request classifier scores each incoming request on complexity heuristics: token count, task type (code generation → large model, simple QA → small model), and user tier (enterprise → large model priority). The router maintains a health and utilization view of all inference servers and dispatches to the least-loaded server hosting the appropriate model. If all servers for a requested model are saturated (queue depth > 10), the router either queues the request (for priority users) or routes it to a smaller model (for best-effort users) with a response header indicating the actual model used.

Database Design

Request metadata is logged to Kafka and archived to ClickHouse: (request_id, user_id, model_version, adapter_id, prompt_tokens, completion_tokens, ttft_ms, total_latency_ms, finish_reason, timestamp). This powers cost dashboards, per-user usage tracking, and model performance monitoring. Conversation context (for multi-turn sessions) is stored in Redis with the key ctx:{session_id} → compressed conversation history JSON, with 1-hour TTL. Active KV cache pointers (which GPU memory pages are allocated to which session) are maintained in-memory by vLLM's scheduler.

API Design

POST /v1/chat/completions — OpenAI-compatible chat completion endpoint; supports stream=true for server-sent event streaming. POST /v1/completions — Single-turn text completion with temperature, top_p, max_tokens parameters. GET /v1/models — List available models and adapters with context window sizes and rate limit tiers. POST /v1/tokenize — Tokenize text and return token count without running inference (for cost estimation).

Scaling & Bottlenecks

KV cache memory exhaustion is the primary limiting factor: a 128K-context request for a 70B model consumes ~40 GB of KV cache (2 * num_layers * num_kv_heads * 128K * head_dim * 2 bytes). PagedAttention reduces this to actual allocated tokens, but very long contexts can still monopolize GPU memory. A context length quota per user (e.g., max 32K for free tier, 128K for enterprise) prevents long-context requests from evicting all short-context requests from GPU memory.*

TTFT for long prompts (prompt processing / prefill phase) scales with prompt length. For a 128K-token prompt, prefill can take 5–30 seconds on a single server. Prompt caching (caching KV cache for repeated system prompts shared across users) reduces prefill time for the repeated portion to zero. Speculative decoding (running a small 1B draft model to propose k tokens per step, verified by the large model in parallel) improves throughput by 2–3x for short response tasks.

Key Trade-offs

Model accuracy vs. serving cost: Larger models produce better outputs but at 5–10x the cost; intelligent routing (defaulting to 7B, escalating to 70B only for complex tasks) balances quality and cost.
Streaming vs. batch responses: Streaming delivers the first token immediately and improves perceived responsiveness; batch responses achieve higher throughput by maximizing parallelism but introduce latency for the user.
KV cache eviction vs. recomputation: Evicting idle KV caches to CPU saves GPU memory but adds 50ms reload latency; recomputing from conversation history adds 1–5 seconds for long contexts.
Tensor parallelism vs. pipeline parallelism: Tensor parallelism (split model across GPUs on the same node, high NVLink bandwidth) is preferred for low-latency serving; pipeline parallelism (split model layers across nodes, lower bandwidth requirement) scales to larger model sizes but adds inter-node communication latency.