LLM Serving Explained: Deploying Language Models at Scale
Learn LLM serving infrastructure — batching strategies, KV cache optimization, quantization, and choosing between self-hosted and API-based deployments.
LLM Serving
LLM serving is the infrastructure and techniques for deploying large language models in production, optimizing for throughput, latency, cost, and reliability at scale.
What It Really Means
Training an LLM gets the headlines, but serving it — running inference on user requests in real time — is where the engineering challenge lives for most teams. A 70-billion-parameter model requires ~140GB of memory in FP16 just to load the weights. Each request needs additional memory for the KV cache. Naive serving of even a single request can take seconds on high-end GPUs.
LLM serving is fundamentally a systems engineering problem. You are optimizing the interaction between a massive neural network, GPU memory, network I/O, and user expectations. The key constraints are:
- Memory: Model weights + KV cache must fit in GPU memory
- Compute: Each token generation requires a full forward pass
- Latency: Users expect time-to-first-token under 500ms and streaming output
- Throughput: Serving hundreds or thousands of concurrent users
- Cost: GPU hours are expensive — $2-8/hour per A100
The serving stack has evolved rapidly. Frameworks like vLLM, TensorRT-LLM, and SGLang have made it possible to serve models that would have been impractical just two years ago. Understanding these systems is critical for anyone building production AI applications, whether you self-host or use API providers.
How It Works in Practice
Key Optimization Techniques
Continuous Batching: Instead of waiting for a batch to complete before starting the next one, continuous batching adds new requests to the batch as existing requests finish. This dramatically improves GPU utilization — from ~30% with static batching to ~80% with continuous batching.
KV Cache Management: During autoregressive generation, the model recomputes attention over all previous tokens. The KV (key-value) cache stores intermediate attention states so they are not recomputed. PagedAttention (from vLLM) manages this cache like virtual memory, eliminating fragmentation and enabling larger batch sizes.
Quantization: Reducing model precision from FP16 (16 bits per parameter) to INT8 or INT4 reduces memory by 2-4x with minimal quality loss. A 70B model at INT4 fits on a single 80GB A100 instead of requiring two.
Speculative Decoding: Use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model. This can achieve 2-3x speedup for the large model's generation.
Prefix Caching: When multiple requests share the same system prompt, cache the KV states for that prefix. Subsequent requests skip re-encoding the shared prefix.
Deployment Options
| Option | Latency | Cost | Control | Effort |
|---|---|---|---|---|
| API providers (OpenAI, Anthropic) | 200-800ms TTFT | $$/M tokens | Low | Low |
| Managed serving (AWS Bedrock, GCP Vertex) | 300-1000ms TTFT | $$-$$$ | Medium | Medium |
| Self-hosted (vLLM on GPU VMs) | 100-500ms TTFT | $ (at scale) | High | High |
| Edge deployment (quantized, mobile) | 50-200ms TTFT | Device cost | Full | Very High |
Implementation
Serving with vLLM
Load Balancing Multiple GPU Workers
Trade-offs
Self-Hosted vs API
Self-host when:
- Volume exceeds ~$10K/month in API costs
- Data privacy requirements prohibit sending data externally
- You need sub-100ms time-to-first-token
- You want full control over model selection and updates
Use APIs when:
- Volume is low to moderate
- You need access to frontier models (GPT-4o, Claude Opus)
- You lack GPU infrastructure expertise
- You want to avoid managing hardware failures and scaling
Advantages of Understanding LLM Serving
- Make informed build-vs-buy decisions
- Optimize costs (often 5-10x savings at scale with self-hosting)
- Debug latency issues in production AI systems
- Design better architectures for multi-agent systems
Disadvantages of Self-Hosting
- GPU procurement and maintenance burden
- Model updates require manual deployment
- Need expertise in CUDA, GPU networking, and ML infrastructure
- On-call burden for GPU failures
Common Misconceptions
-
"Bigger GPUs always mean faster inference" — Memory bandwidth, not compute, is typically the bottleneck for LLM inference. An H100 is faster than an A100 partly because of higher memory bandwidth (3.35 TB/s vs 2 TB/s), not just more FLOPS.
-
"Quantization always degrades quality" — INT8 quantization typically causes <1% quality degradation on benchmarks. INT4 (GPTQ, AWQ) is model-dependent but often acceptable. The quality-cost trade-off is usually worthwhile.
-
"You need the latest GPU for LLM serving" — Older GPUs like A10G (24GB) serve quantized 7B-13B models effectively. Not every application needs a 70B model. Match GPU to model size and requirements.
-
"Streaming output is just a UX feature" — Streaming reduces perceived latency from seconds to milliseconds. The total generation time is the same, but the user sees the first token in ~200ms instead of waiting 5 seconds for the complete response.
How This Appears in Interviews
LLM serving is a key topic in AI infrastructure interviews:
- "Design an LLM serving system for 10,000 concurrent users" — discuss batching, KV cache, GPU selection, auto-scaling, and load balancing. See our interview questions on AI infrastructure.
- "How would you reduce LLM inference costs by 5x?" — discuss quantization, smaller models, caching, and token budgeting.
- "Explain the difference between prefill and decode phases" — the prefill processes all input tokens in parallel, while decode generates tokens one at a time. Different optimization strategies apply to each.
Related Concepts
- Token Budgeting — Managing input/output costs
- Transformer Architecture — The model architecture being served
- Multi-Agent Systems — Workloads that stress serving infrastructure
- Fine-Tuning vs RAG — Fine-tuned models need serving infrastructure
- Attention Mechanism — KV cache optimization depends on understanding attention
- Algoroq Pricing — Practice ML infrastructure interview questions
GO DEEPER
Learn from senior engineers in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.