Learn LLM serving infrastructure — batching strategies, KV cache optimization, quantization, and choosing between self-hosted and API-based deployments.

LLM Serving

LLM serving is the infrastructure and techniques for deploying large language models in production, optimizing for throughput, latency, cost, and reliability at scale.

What It Really Means

Training an LLM gets the headlines, but serving it — running inference on user requests in real time — is where the engineering challenge lives for most teams. A 70-billion-parameter model requires ~140GB of memory in FP16 just to load the weights. Each request needs additional memory for the KV cache. Naive serving of even a single request can take seconds on high-end GPUs.

LLM serving is fundamentally a systems engineering problem. You are optimizing the interaction between a massive neural network, GPU memory, network I/O, and user expectations. The key constraints are:

Memory: Model weights + KV cache must fit in GPU memory
Compute: Each token generation requires a full forward pass
Latency: Users expect time-to-first-token under 500ms and streaming output
Throughput: Serving hundreds or thousands of concurrent users
Cost: GPU hours are expensive — $2-8/hour per A100

The serving stack has evolved rapidly. Frameworks like vLLM, TensorRT-LLM, and SGLang have made it possible to serve models that would have been impractical just two years ago. Understanding these systems is critical for anyone building production AI applications, whether you self-host or use API providers.

How It Works in Practice

Key Optimization Techniques

Continuous Batching: Instead of waiting for a batch to complete before starting the next one, continuous batching adds new requests to the batch as existing requests finish. This dramatically improves GPU utilization — from ~30% with static batching to ~80% with continuous batching.

KV Cache Management: During autoregressive generation, the model recomputes attention over all previous tokens. The KV (key-value) cache stores intermediate attention states so they are not recomputed. PagedAttention (from vLLM) manages this cache like virtual memory, eliminating fragmentation and enabling larger batch sizes.

Quantization: Reducing model precision from FP16 (16 bits per parameter) to INT8 or INT4 reduces memory by 2-4x with minimal quality loss. A 70B model at INT4 fits on a single 80GB A100 instead of requiring two.

Speculative Decoding: Use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model. This can achieve 2-3x speedup for the large model's generation.

Prefix Caching: When multiple requests share the same system prompt, cache the KV states for that prefix. Subsequent requests skip re-encoding the shared prefix.

Deployment Options

Option	Latency	Cost	Control	Effort
API providers (OpenAI, Anthropic)	200-800ms TTFT	$$/M tokens	Low	Low
Managed serving (AWS Bedrock, GCP Vertex)	300-1000ms TTFT	$$-$$$	Medium	Medium
Self-hosted (vLLM on GPU VMs)	100-500ms TTFT	$ (at scale)	High	High
Edge deployment (quantized, mobile)	50-200ms TTFT	Device cost	Full	Very High

Implementation

Serving with vLLM

python

Load Balancing Multiple GPU Workers

python

Trade-offs

Self-Hosted vs API

Self-host when:

Volume exceeds ~$10K/month in API costs
Data privacy requirements prohibit sending data externally
You need sub-100ms time-to-first-token
You want full control over model selection and updates

Use APIs when:

Volume is low to moderate
You need access to frontier models (GPT-4o, Claude Opus)
You lack GPU infrastructure expertise
You want to avoid managing hardware failures and scaling

Advantages of Understanding LLM Serving

Make informed build-vs-buy decisions
Optimize costs (often 5-10x savings at scale with self-hosting)
Debug latency issues in production AI systems
Design better architectures for multi-agent systems

Disadvantages of Self-Hosting

GPU procurement and maintenance burden
Model updates require manual deployment
Need expertise in CUDA, GPU networking, and ML infrastructure
On-call burden for GPU failures

Common Misconceptions

"Bigger GPUs always mean faster inference" — Memory bandwidth, not compute, is typically the bottleneck for LLM inference. An H100 is faster than an A100 partly because of higher memory bandwidth (3.35 TB/s vs 2 TB/s), not just more FLOPS.
"Quantization always degrades quality" — INT8 quantization typically causes <1% quality degradation on benchmarks. INT4 (GPTQ, AWQ) is model-dependent but often acceptable. The quality-cost trade-off is usually worthwhile.
"You need the latest GPU for LLM serving" — Older GPUs like A10G (24GB) serve quantized 7B-13B models effectively. Not every application needs a 70B model. Match GPU to model size and requirements.
"Streaming output is just a UX feature" — Streaming reduces perceived latency from seconds to milliseconds. The total generation time is the same, but the user sees the first token in ~200ms instead of waiting 5 seconds for the complete response.

How This Appears in Interviews

LLM serving is a key topic in AI infrastructure interviews:

"Design an LLM serving system for 10,000 concurrent users" — discuss batching, KV cache, GPU selection, auto-scaling, and load balancing. See our interview questions on AI infrastructure.
"How would you reduce LLM inference costs by 5x?" — discuss quantization, smaller models, caching, and token budgeting.
"Explain the difference between prefill and decode phases" — the prefill processes all input tokens in parallel, while decode generates tokens one at a time. Different optimization strategies apply to each.

Related Concepts

Token Budgeting — Managing input/output costs
Transformer Architecture — The model architecture being served
Multi-Agent Systems — Workloads that stress serving infrastructure
Fine-Tuning vs RAG — Fine-tuned models need serving infrastructure
Attention Mechanism — KV cache optimization depends on understanding attention
Algoroq Pricing — Practice ML infrastructure interview questions

LLM Serving Explained: Deploying Language Models at Scale