LLM Serving Explained: Deploying Language Models at Scale

Learn LLM serving infrastructure — batching strategies, KV cache optimization, quantization, and choosing between self-hosted and API-based deployments.

llm-servinginferencedeploymentgpuai-infrastructure

LLM Serving

LLM serving is the infrastructure and techniques for deploying large language models in production, optimizing for throughput, latency, cost, and reliability at scale.

What It Really Means

Training an LLM gets the headlines, but serving it — running inference on user requests in real time — is where the engineering challenge lives for most teams. A 70-billion-parameter model requires ~140GB of memory in FP16 just to load the weights. Each request needs additional memory for the KV cache. Naive serving of even a single request can take seconds on high-end GPUs.

LLM serving is fundamentally a systems engineering problem. You are optimizing the interaction between a massive neural network, GPU memory, network I/O, and user expectations. The key constraints are:

  • Memory: Model weights + KV cache must fit in GPU memory
  • Compute: Each token generation requires a full forward pass
  • Latency: Users expect time-to-first-token under 500ms and streaming output
  • Throughput: Serving hundreds or thousands of concurrent users
  • Cost: GPU hours are expensive — $2-8/hour per A100

The serving stack has evolved rapidly. Frameworks like vLLM, TensorRT-LLM, and SGLang have made it possible to serve models that would have been impractical just two years ago. Understanding these systems is critical for anyone building production AI applications, whether you self-host or use API providers.

How It Works in Practice

Key Optimization Techniques

Continuous Batching: Instead of waiting for a batch to complete before starting the next one, continuous batching adds new requests to the batch as existing requests finish. This dramatically improves GPU utilization — from ~30% with static batching to ~80% with continuous batching.

KV Cache Management: During autoregressive generation, the model recomputes attention over all previous tokens. The KV (key-value) cache stores intermediate attention states so they are not recomputed. PagedAttention (from vLLM) manages this cache like virtual memory, eliminating fragmentation and enabling larger batch sizes.

Quantization: Reducing model precision from FP16 (16 bits per parameter) to INT8 or INT4 reduces memory by 2-4x with minimal quality loss. A 70B model at INT4 fits on a single 80GB A100 instead of requiring two.

Speculative Decoding: Use a small, fast "draft" model to generate candidate tokens, then verify them in parallel with the large model. This can achieve 2-3x speedup for the large model's generation.

Prefix Caching: When multiple requests share the same system prompt, cache the KV states for that prefix. Subsequent requests skip re-encoding the shared prefix.

Deployment Options

OptionLatencyCostControlEffort
API providers (OpenAI, Anthropic)200-800ms TTFT$$/M tokensLowLow
Managed serving (AWS Bedrock, GCP Vertex)300-1000ms TTFT$$-$$$MediumMedium
Self-hosted (vLLM on GPU VMs)100-500ms TTFT$ (at scale)HighHigh
Edge deployment (quantized, mobile)50-200ms TTFTDevice costFullVery High

Implementation

Serving with vLLM

python

Load Balancing Multiple GPU Workers

python

Trade-offs

Self-Hosted vs API

Self-host when:

  • Volume exceeds ~$10K/month in API costs
  • Data privacy requirements prohibit sending data externally
  • You need sub-100ms time-to-first-token
  • You want full control over model selection and updates

Use APIs when:

  • Volume is low to moderate
  • You need access to frontier models (GPT-4o, Claude Opus)
  • You lack GPU infrastructure expertise
  • You want to avoid managing hardware failures and scaling

Advantages of Understanding LLM Serving

  • Make informed build-vs-buy decisions
  • Optimize costs (often 5-10x savings at scale with self-hosting)
  • Debug latency issues in production AI systems
  • Design better architectures for multi-agent systems

Disadvantages of Self-Hosting

  • GPU procurement and maintenance burden
  • Model updates require manual deployment
  • Need expertise in CUDA, GPU networking, and ML infrastructure
  • On-call burden for GPU failures

Common Misconceptions

  • "Bigger GPUs always mean faster inference" — Memory bandwidth, not compute, is typically the bottleneck for LLM inference. An H100 is faster than an A100 partly because of higher memory bandwidth (3.35 TB/s vs 2 TB/s), not just more FLOPS.

  • "Quantization always degrades quality" — INT8 quantization typically causes <1% quality degradation on benchmarks. INT4 (GPTQ, AWQ) is model-dependent but often acceptable. The quality-cost trade-off is usually worthwhile.

  • "You need the latest GPU for LLM serving" — Older GPUs like A10G (24GB) serve quantized 7B-13B models effectively. Not every application needs a 70B model. Match GPU to model size and requirements.

  • "Streaming output is just a UX feature" — Streaming reduces perceived latency from seconds to milliseconds. The total generation time is the same, but the user sees the first token in ~200ms instead of waiting 5 seconds for the complete response.

How This Appears in Interviews

LLM serving is a key topic in AI infrastructure interviews:

  • "Design an LLM serving system for 10,000 concurrent users" — discuss batching, KV cache, GPU selection, auto-scaling, and load balancing. See our interview questions on AI infrastructure.
  • "How would you reduce LLM inference costs by 5x?" — discuss quantization, smaller models, caching, and token budgeting.
  • "Explain the difference between prefill and decode phases" — the prefill processes all input tokens in parallel, while decode generates tokens one at a time. Different optimization strategies apply to each.

Related Concepts

GO DEEPER

Learn from senior engineers in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.