INTERVIEW_QUESTIONS
LLM Interview Questions for Senior Engineers (2026)
15 advanced LLM interview questions with detailed answer frameworks covering transformer architecture, attention mechanisms, tokenization, context windows, fine-tuning, prompt engineering, hallucination mitigation, RLHF, and production LLM systems at top AI companies.
Why LLM Knowledge Is Essential for Senior Engineering Interviews in 2026
Large language models have reshaped the technology industry more rapidly than any innovation since the smartphone. In 2026, LLM expertise is no longer confined to AI research teams. Product engineers, platform engineers, and infrastructure engineers at companies from Google and OpenAI to mid-stage startups are expected to understand how LLMs work, how to integrate them into production systems, and how to reason about their limitations.
Interviewers ask LLM questions to assess three things: your understanding of the underlying architecture (can you reason about why models behave the way they do?), your practical experience deploying LLMs (can you build reliable products on top of a probabilistic foundation?), and your awareness of failure modes (can you mitigate hallucinations, manage cost, and ensure safety?). These questions appear in system design rounds, machine learning rounds, and even general engineering rounds where LLM integration is relevant to the role.
This guide covers 15 questions that span foundational concepts and production engineering. Each question is framed around the interviewer's true intent, followed by an answer framework that demonstrates both theoretical understanding and hands-on experience. Pair this with the System Design Interview Guide and AlgoRoq's practice platform for comprehensive preparation.
Question 1: Explain the Transformer Architecture and Why It Replaced RNNs
What the interviewer is really asking: Do you understand the fundamental architecture behind modern LLMs at a level where you can reason about performance characteristics, scaling behavior, and design decisions? They want more than a textbook recitation; they want to see that you understand the why.
Answer framework:
The transformer architecture, introduced in the 2017 "Attention Is All You Need" paper, replaced recurrent neural networks (RNNs) for sequence modeling. The key innovation was replacing sequential processing with parallel self-attention.
Why RNNs failed at scale. RNNs process tokens sequentially: the representation of token t depends on the representation of token t-1. This creates two problems. First, training cannot be parallelized across sequence positions, making it slow on modern GPU hardware. Second, long-range dependencies are difficult to learn because gradients must flow through many sequential steps, leading to vanishing or exploding gradients. LSTMs and GRUs mitigate but do not solve this.
How transformers solve this. The transformer computes attention between all pairs of tokens simultaneously. Each token can directly attend to any other token regardless of distance, eliminating the sequential bottleneck. This enables massive parallelization during training and eliminates the vanishing gradient problem for long-range dependencies.
Architecture components. Walk through the key building blocks:
- Token embeddings + positional encodings: Since transformers process all tokens in parallel, they have no inherent notion of order. Positional encodings (sinusoidal in the original paper, learned or RoPE in modern models) inject position information.
- Multi-head self-attention: Each head learns to attend to different types of relationships (syntactic, semantic, positional). Multiple heads provide representational diversity.
- Feed-forward networks: After attention aggregates information across positions, a position-wise feed-forward network processes each token's representation independently. This is where much of the model's factual knowledge is stored.
- Layer normalization and residual connections: These stabilize training of deep networks (modern LLMs have 80-120+ layers).
Scaling properties. The key insight for interviews is that self-attention has O(n^2) complexity in sequence length, which is why context window size is a major constraint. Modern architectures use techniques like grouped query attention (GQA), sliding window attention, and sparse attention to reduce this cost.
For deeper architecture study, see Transformer Architecture and Neural Network Fundamentals.
Question 2: How Does the Attention Mechanism Work, and What Are Its Variants?
What the interviewer is really asking: Can you explain attention at a mathematical level and discuss the practical variants used in modern production LLMs? They want to see you connect the math to engineering tradeoffs like memory usage, inference speed, and model quality.
Answer framework:
Core mechanism. Attention computes a weighted sum of value vectors, where the weights are determined by the compatibility between query and key vectors. For input sequence X:
- Q = XW_Q (what am I looking for?)
- K = XW_K (what do I contain?)
- V = XW_V (what information do I carry?)
- Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
The division by sqrt(d_k) prevents the dot products from growing too large in magnitude, which would push softmax into regions with extremely small gradients.
Multi-head attention. Rather than computing a single attention function, split Q, K, V into h heads, compute attention independently for each head, concatenate, and project. Each head can learn different attention patterns: one head might focus on syntactic relationships (subject-verb agreement), another on semantic similarity, and another on positional proximity.
Key variants in production LLMs:
-
Multi-Query Attention (MQA): All heads share the same K and V projections but have independent Q projections. This dramatically reduces the KV cache size during inference (by a factor of n_heads), improving throughput for autoregressive generation with minimal quality loss. Used in PaLM and Falcon.
-
Grouped Query Attention (GQA): A middle ground between full multi-head and multi-query. Groups of query heads share K and V projections. For example, 32 query heads might share 8 KV groups. This provides most of MQA's efficiency benefits while retaining more of multi-head attention's quality. Used in Llama 2, Llama 3, and Mistral.
-
Sliding Window Attention: Each token attends only to a fixed window of nearby tokens (e.g., 4096 tokens). Information propagates across the full context through multiple layers: layer 1 sees 4096 tokens, layer 2 effectively sees 8192 (through attended tokens from layer 1), and so on. Used in Mistral and as a component of hybrid architectures.
-
Flash Attention: Not a different attention pattern but an IO-aware implementation that tiles the attention computation to minimize reads and writes to GPU high-bandwidth memory (HBM). Provides 2-4x speedup with exact attention computation (no approximation). Essential for training and serving modern LLMs.
Discuss the KV cache: during autoregressive generation, previously computed K and V vectors are cached to avoid recomputation. The KV cache size grows linearly with sequence length and is often the memory bottleneck for long-context inference. GQA reduces KV cache size proportionally to the number of KV groups.
See Attention Mechanisms and GPU Memory Management for deeper coverage.
Question 3: Explain Tokenization and Its Impact on LLM Behavior
What the interviewer is really asking: Do you understand the layer between raw text and model inputs, and can you reason about how tokenization choices affect model capabilities, cost, and failure modes? This question separates engineers who use LLMs as black boxes from those who understand the abstractions.
Answer framework:
Tokenization converts raw text into a sequence of integer token IDs that the model processes. The tokenizer's vocabulary and algorithm fundamentally shape what the model can and cannot do.
Byte-Pair Encoding (BPE). The dominant tokenization algorithm. Start with individual characters (or bytes), then iteratively merge the most frequent adjacent pair into a new token. Repeat until the vocabulary reaches a target size (typically 32K-128K tokens). BPE naturally creates tokens for common words ("the", "and") and subword units for rarer words ("unhappiness" -> "un" + "happiness" or "un" + "hap" + "piness" depending on the vocabulary).
Impact on model behavior:
-
Arithmetic and reasoning. Tokenizers often split numbers inconsistently: "1234" might be one token while "12345" might be two tokens ("123" + "45"). This makes arithmetic harder for the model because the same digit has different representations in different positions. This is why LLMs struggle with precise numerical computation.
-
Multilingual performance. BPE vocabularies trained primarily on English text produce long token sequences for non-English text. A Chinese sentence might require 3x more tokens than its English equivalent, consuming more context window and increasing cost. Modern multilingual tokenizers address this by training on balanced multilingual corpora.
-
Code understanding. Whitespace-heavy languages like Python can waste tokens on indentation. Modern tokenizers handle this by encoding common indentation patterns as single tokens.
-
Cost implications. LLM API pricing is per-token. A tokenizer that is inefficient for your domain directly increases cost. For example, if your application processes medical terminology, those terms might be split into many subword tokens, consuming more context and costing more per request.
SentencePiece vs. tiktoken. SentencePiece (used by Llama, T5) operates on raw text and handles whitespace as part of the token. Tiktoken (used by GPT models) is a fast BPE implementation that operates on UTF-8 bytes. Both produce similar results but have different edge cases with special characters and whitespace.
See NLP Fundamentals and LLM API Design for related topics.
Question 4: How Do Context Windows Work and What Are the Challenges of Long Context?
What the interviewer is really asking: Can you reason about the practical constraints of finite context windows and the engineering solutions for working within them? They want to see that you understand both the theoretical limitations and the production workarounds.
Answer framework:
The context window is the maximum number of tokens a model can process in a single forward pass. It is a fundamental constraint because the self-attention mechanism computes pairwise interactions between all tokens, with O(n^2) time and memory complexity.
Why context length matters. A larger context window enables: processing longer documents, maintaining longer conversation histories, providing more examples in few-shot prompts, and including more retrieved context in RAG systems. But larger context comes with costs.
Challenges of long context:
-
Computational cost. Attention is O(n^2) in sequence length. Doubling the context window quadruples the attention computation. Even with Flash Attention and hardware optimization, this imposes real cost and latency constraints.
-
Lost in the middle. Research shows that LLMs pay more attention to information at the beginning and end of the context window, and less to information in the middle. This means simply stuffing more context into the window does not guarantee the model will use it effectively.
-
KV cache memory. During autoregressive generation, the KV cache stores key and value tensors for all previous tokens. For a model with hidden dimension 8192, 128 layers, and 128K context, the KV cache alone can consume over 100 GB of GPU memory. This limits the number of concurrent requests a server can handle.
-
Retrieval accuracy vs. context length. Needle-in-a-haystack evaluations show that even models trained for long context have degraded retrieval accuracy in the middle of very long sequences. Structured retrieval (RAG) often outperforms raw long context for information-seeking tasks.
Engineering solutions:
-
RAG (Retrieval-Augmented Generation): Rather than fitting everything into the context, retrieve the most relevant chunks and include only those. This is effective for knowledge-intensive tasks where the relevant information is a small fraction of the total corpus.
-
Sliding window with summarization: Process long documents in chunks, summarize each chunk, then process the summaries. This loses detail but preserves high-level structure.
-
Hierarchical attention: Use full attention within local windows and sparse attention across windows. This reduces complexity while maintaining some long-range connectivity.
-
Context caching / prompt caching: Cache the KV states for common prefixes (system prompts, few-shot examples) so they do not need to be recomputed for each request. This reduces latency and cost for applications with shared context.
See RAG Architecture and Vector Databases for the retrieval side of this problem.
Question 5: Explain Fine-Tuning Strategies for LLMs and When to Use Each
What the interviewer is really asking: Can you navigate the decision space between prompt engineering, fine-tuning, and full training? They want to see that you understand the cost-quality-speed tradeoffs and can recommend the right approach for a given use case.
Answer framework:
Present fine-tuning as a spectrum from lightweight to heavyweight:
Prompt engineering (no fine-tuning). Modify the model's behavior through instructions and examples in the prompt. Best for: rapid prototyping, tasks where the base model already has the necessary knowledge, and situations where you cannot afford the data collection and training overhead. Limitations: consumes context window, inconsistent behavior, cannot teach genuinely new capabilities.
Few-shot in-context learning. Provide examples of desired input-output pairs in the prompt. More reliable than zero-shot for structured tasks. Limitation: example quality matters enormously, and the context window limits the number of examples.
LoRA (Low-Rank Adaptation). Freeze the pre-trained weights and add small trainable low-rank matrices to the attention layers. Typically adds 0.1-1% trainable parameters. Best for: adapting a model to a specific domain or style with moderate amounts of data (1K-100K examples). LoRA adapters are small (megabytes) and can be swapped at serving time, enabling multi-tenant deployments where different customers have different fine-tuned behaviors on the same base model.
QLoRA. Combines LoRA with 4-bit quantization of the base model. Enables fine-tuning a 70B parameter model on a single 48GB GPU. The quality tradeoff is small for most tasks, making this the practical default for resource-constrained teams.
Full fine-tuning. Update all model parameters. Requires the most data (100K+ examples), compute (multiple GPUs for days or weeks), and produces the highest quality results when sufficient data is available. Use for: building a model that needs to deeply internalize a new domain (medical, legal, financial) or developing a significant new capability.
Decision framework for interviews:
- Start with prompt engineering. If it works, stop.
- If prompt engineering is inconsistent or consumes too much context, try LoRA fine-tuning.
- If LoRA fine-tuning plateaus and you have abundant high-quality data, consider full fine-tuning.
- Never fine-tune when the problem is data quality. Garbage in, garbage out is amplified by fine-tuning.
See Transfer Learning and Model Training Infrastructure for related discussion.
Question 6: How Does Prompt Engineering Work and What Are Best Practices?
What the interviewer is really asking: Can you systematically optimize LLM behavior without training, and do you understand the principles behind effective prompting rather than relying on trial and error?
Answer framework:
Prompt engineering is the practice of structuring inputs to LLMs to elicit desired outputs. It is not guesswork; effective prompt engineering follows empirical principles.
Core techniques:
-
Role and persona. Setting a role ("You are an expert security engineer") activates relevant knowledge and communication patterns. Be specific: "You are a senior security engineer reviewing code for SQL injection vulnerabilities" outperforms generic role instructions.
-
Structured output. Request specific output formats (JSON, markdown tables, numbered lists). Provide a schema or example of the desired output. This dramatically reduces parsing errors and post-processing complexity.
-
Chain of thought (CoT). Instruct the model to reason step by step before producing an answer. This improves accuracy on multi-step reasoning tasks (math, logic, code debugging) by 20-40%. For production systems, you can request reasoning in a structured block and parse only the final answer.
-
Few-shot examples. Provide 3-5 examples of desired input-output pairs. Example selection matters: choose examples that cover edge cases and demonstrate the desired level of detail. Order matters: place the most representative example last (recency bias).
-
Constraints and guardrails. Explicitly state what the model should not do: "Do not invent information. If you are unsure, say so." Negative instructions are important for reducing hallucination.
Production prompt engineering:
-
Prompt templating. Separate the static prompt structure from dynamic variables. Use a templating system that validates variable types and lengths. Store prompt versions in version control with associated evaluation results.
-
Evaluation-driven iteration. Build an evaluation suite of 50-200 test cases with expected outputs. Score each prompt version against this suite using automated metrics (exact match, semantic similarity, LLM-as-judge) and compare across versions. Never change a production prompt without running the eval suite.
-
Prompt injection defense. Production prompts must be robust against adversarial user inputs that attempt to override system instructions. Use input sanitization, instruction hierarchy (system > user), and output validation as defense layers.
Provide your review as a JSON array of findings:
[
{{
"severity": "critical|high|medium|low",
"line": <line_number>,
"issue": "
If you find no issues, return an empty array []. Do not invent issues.""", )
See RAG Architecture and AI Safety for comprehensive treatment of grounding and safety.
Question 8: Explain RLHF and Its Role in Modern LLM Training
What the interviewer is really asking: Do you understand the training pipeline that transforms a base language model into an assistant, and can you reason about the tradeoffs of alignment training? They want to see that you understand why RLHF exists and where it can go wrong.
Answer framework:
RLHF (Reinforcement Learning from Human Feedback) is the process that aligns a pre-trained language model with human preferences. It bridges the gap between "predict the next token" (what pre-training optimizes) and "be helpful, harmless, and honest" (what users want).
The three-stage pipeline:
Stage 1: Supervised Fine-Tuning (SFT). Take the pre-trained base model and fine-tune it on a curated dataset of high-quality conversations. This teaches the model the format and style of a helpful assistant. The SFT dataset is typically 10K-100K examples written or curated by humans.
Stage 2: Reward Model Training. Collect comparison data: given a prompt, generate two or more responses and have humans rank them by quality. Train a reward model (often a smaller LM) to predict human preference. The reward model learns to assign higher scores to responses humans prefer. Key challenge: inter-annotator disagreement and the difficulty of defining "quality" for subjective tasks.
Stage 3: RL Optimization. Use the reward model as the objective function and optimize the SFT model using a policy gradient algorithm (typically PPO - Proximal Policy Optimization). The model learns to generate responses that maximize the reward model's score while staying close to the SFT model (via a KL divergence penalty). The KL penalty prevents reward hacking, where the model finds degenerate outputs that score high with the reward model but are not actually good responses.
Alternatives and improvements:
-
DPO (Direct Preference Optimization): Eliminates the need for a separate reward model by directly optimizing the policy on preference data. Simpler to implement and more stable than PPO. Has become the dominant approach for many teams.
-
Constitutional AI (CAI): Instead of human feedback, use AI feedback guided by a set of principles (a "constitution"). The model critiques and revises its own outputs according to these principles. Scales better than human feedback but may miss nuances that humans catch.
-
RLAIF (RL from AI Feedback): Use a more capable model to provide feedback for training a smaller model. Practical for organizations that cannot afford large-scale human annotation.
Failure modes of RLHF:
- Reward hacking: The model exploits the reward model's weaknesses (e.g., being overly verbose because the reward model was trained on data where longer responses were preferred)
- Mode collapse: Over-optimization causes the model to produce repetitive, safe responses that score well on the reward model but lack diversity
- Sycophancy: The model agrees with the user rather than providing accurate information because agreement was rewarded during training
See Reinforcement Learning and AI Alignment for the theoretical foundations.
Question 9: Design a Production RAG System
What the interviewer is really asking: Can you build a retrieval-augmented generation system that is accurate, fast, and maintainable? They want to see that you understand the full pipeline from document ingestion to response generation, including the many ways RAG systems fail in practice.
Answer framework:
A production RAG system has four major components: document processing, indexing, retrieval, and generation.
Document processing. Ingest documents from various sources (PDFs, web pages, databases). Parse and chunk them into segments suitable for retrieval. Chunking strategy is critical: too small and you lose context, too large and retrieval becomes imprecise. Use semantic chunking (split at paragraph or section boundaries) rather than fixed-size chunking. Overlap chunks by 10-20% to avoid losing information at boundaries.
Indexing. Generate vector embeddings for each chunk using an embedding model. Store embeddings in a vector database (Pinecone, Weaviate, pgvector, Qdrant). Also maintain a keyword index (BM25) for hybrid search. Store chunk metadata (source document, page number, section title) alongside embeddings for citation and filtering.
Retrieval. Given a user query, retrieve relevant chunks using hybrid search: combine dense retrieval (vector similarity) with sparse retrieval (BM25 keyword matching). Use Reciprocal Rank Fusion (RRF) to merge results from both methods. Apply a re-ranking model (cross-encoder) to the top-k results for higher precision.
Generation. Construct a prompt with the retrieved chunks and the user query. Instruct the model to answer based only on the provided context and to cite sources. Parse the response for citations and validate them.
Common RAG failure modes:
- Retrieval misses relevant documents (fix: better chunking, hybrid search, query expansion)
- Retrieved context is relevant but the model ignores it (fix: prompt engineering, position context strategically)
- Model hallucinates beyond the retrieved context (fix: citation validation, faithfulness checking)
- Stale index does not reflect updated documents (fix: incremental indexing pipeline with change detection)
See Vector Databases, Search Systems, and the System Design Interview Guide.
Question 10: How Do You Evaluate LLM Outputs Systematically?
What the interviewer is really asking: Can you build evaluation systems for a technology whose outputs are subjective and variable? They want to see that you have moved beyond vibes-based evaluation to systematic, reproducible measurement.
Answer framework:
LLM evaluation operates at multiple levels, from automated metrics to human judgment.
Automated metrics for specific tasks:
- Exact match / F1: For extractive QA and structured outputs where there is a single correct answer
- BLEU / ROUGE: For summarization and translation, though these correlate poorly with human judgment for open-ended generation
- Code execution: For code generation, run the generated code against test cases. Pass rate is an unambiguous metric
- Semantic similarity: Compare embedding similarity between generated and reference outputs. Better than lexical metrics for paraphrased but correct answers
LLM-as-judge: Use a more capable model (or the same model with a carefully designed rubric) to evaluate outputs on specific dimensions: accuracy, completeness, clarity, safety, and relevance. This scales better than human evaluation and correlates well with human judgment when the rubric is specific.
Evaluation best practices:
- Build a golden test set of 100-500 examples covering common cases, edge cases, and adversarial inputs
- Version your eval suite alongside your prompts
- Run evals on every prompt change, model upgrade, and RAG index update
- Track metrics over time to detect regression
- Combine automated evals with periodic human evaluation to calibrate the automated metrics
See ML Evaluation and Testing Strategies for related frameworks.
Question 11: How Do You Optimize LLM Inference Cost and Latency?
What the interviewer is really asking: Can you make LLM serving economically viable for a production application? They want to see that you can reason about the cost-quality-latency tradeoff and apply the right optimizations for a given use case.
Answer framework:
LLM inference is expensive because of model size, memory bandwidth limitations, and autoregressive generation. Optimization requires attacking multiple dimensions:
Model-level optimizations:
- Quantization: Reduce model weights from 16-bit to 8-bit or 4-bit. INT8 quantization typically preserves 99%+ of quality with 2x memory reduction and significant speedup. 4-bit (GPTQ, AWQ) enables running large models on consumer hardware with modest quality loss.
- Distillation: Train a smaller model to mimic the larger model's outputs. A well-distilled 7B model can match a 70B model's performance on specific tasks. This provides the largest latency and cost reduction.
- Speculative decoding: Use a small draft model to generate candidate tokens quickly, then verify them in parallel with the large model. When the draft model's predictions are accepted (which happens frequently for predictable text), inference speed increases 2-3x.
Serving-level optimizations:
- KV cache management: Use paged attention (vLLM) to manage KV cache memory efficiently. This eliminates memory fragmentation and enables serving more concurrent requests.
- Continuous batching: Instead of waiting for all requests in a batch to finish, start new requests as soon as slots become available. This improves GPU utilization from 30-40% to 80-90%.
- Prompt caching: Cache KV states for common prefixes (system prompts, few-shot examples). For applications where 80% of the prompt is shared across requests, this reduces time-to-first-token dramatically.
Architecture-level optimizations:
- Semantic caching: Cache responses for semantically similar queries. Use embedding similarity to detect cache hits. Effective for FAQ-style applications where many users ask similar questions.
- Routing: Use a small classifier to route queries to the cheapest model that can handle them. Simple queries go to a small model; complex queries go to a large model. This can reduce average cost by 50-70% with minimal quality impact.
- Streaming: Return tokens as they are generated rather than waiting for the complete response. This reduces perceived latency even when actual generation time is unchanged.
See Performance Optimization and Cost Optimization for broader infrastructure context.
Question 12: Explain the Difference Between Encoder, Decoder, and Encoder-Decoder Transformers
What the interviewer is really asking: Can you map the transformer architecture taxonomy to practical use cases, and do you understand why different tasks require different architectures?
Answer framework:
Encoder-only (e.g., BERT, RoBERTa). Processes the full input bidirectionally: each token attends to all other tokens. The output is a contextual representation of the input, not generated text. Best for understanding tasks: classification, named entity recognition, semantic similarity, and retrieval embeddings. Not suitable for text generation.
Decoder-only (e.g., GPT, Llama, Claude). Processes tokens left-to-right with causal masking: each token can only attend to previous tokens. Generates text autoregressively one token at a time. This is the architecture behind modern LLMs. Best for: text generation, conversation, code generation, and reasoning. Can also perform understanding tasks via prompting ("Classify this text as positive or negative: ...") but is less parameter-efficient for pure understanding tasks than encoder-only models.
Encoder-decoder (e.g., T5, BART). The encoder processes the input bidirectionally; the decoder generates output autoregressively while attending to the encoder's representations via cross-attention. Best for: translation, summarization, and tasks where the input and output have different structures. The encoder-decoder architecture provides a natural separation between understanding the input and generating the output.
Why decoder-only dominates in 2026. Decoder-only models have won the scaling competition because they are simpler (one architecture, one training objective), scale more predictably, and can perform both understanding and generation tasks. The flexibility of in-context learning means a single large decoder-only model can replace many specialized encoder-only models. However, for embedding and retrieval tasks, encoder-only models remain more efficient and effective.
Production implications:
- Use encoder-only models for search, classification, and embedding (smaller, faster, cheaper)
- Use decoder-only models for generation tasks (chatbots, content creation, code generation)
- Use encoder-decoder models for structured transformation tasks (translation, summarization) where you want the architecture to enforce the input-output separation
See NLP Architecture Comparison and Model Selection for detailed comparisons.
Question 13: How Do You Build Guardrails for Production LLM Applications?
What the interviewer is really asking: Can you build LLM applications that are safe, reliable, and compliant? They are testing whether you understand that deploying an LLM without guardrails is a liability, not a feature.
Answer framework:
Guardrails operate at three stages: input, processing, and output.
Input guardrails:
- Prompt injection detection: Classify incoming user messages for injection attempts that try to override system instructions. Use a combination of heuristic rules (detecting instruction-like patterns) and a trained classifier.
- Content filtering: Block or flag inputs that request harmful content (violence, illegal activities, PII extraction). Use a content safety classifier alongside keyword filters.
- Rate limiting and abuse detection: Detect automated abuse patterns (rapid-fire requests, systematic prompt probing) and throttle or block abusive users.
- PII detection: Scan inputs for personally identifiable information and either redact it before sending to the LLM or warn the user.
Processing guardrails:
- System prompt hardening: Design system prompts that are resistant to override attempts. Use instruction hierarchy where system-level instructions take precedence over user-level instructions.
- Tool use validation: If the LLM can call tools (APIs, databases), validate every tool call against an allowlist of permitted operations and parameter ranges before execution. Never let the LLM construct arbitrary SQL, API calls, or shell commands without validation.
- Context isolation: In multi-tenant applications, ensure that one user's conversation context cannot leak into another user's session.
Output guardrails:
- Content safety filtering: Scan generated outputs for harmful, biased, or inappropriate content before returning to the user.
- Factuality checking: For applications where accuracy matters, validate factual claims against a knowledge base or use the citation verification pipeline described in the hallucination question.
- Format validation: For structured outputs (JSON, code), validate against a schema before returning. Retry with a corrective prompt if validation fails.
- Toxicity scoring: Run outputs through a toxicity classifier and block or rephrase responses that exceed the threshold.
See Application Security and AI Safety for comprehensive security treatment.
Question 14: Explain Model Distillation and When You Would Use It
What the interviewer is really asking: Can you make large models practical for production by training smaller, faster models that preserve the large model's capabilities? They want to see that you understand the cost-quality tradeoff and when distillation is the right tool.
Answer framework:
Distillation trains a smaller "student" model to mimic a larger "teacher" model's behavior. The student learns not just the correct answers but the teacher's probability distribution over all possible answers, which contains richer information ("dark knowledge") than hard labels alone.
How it works:
- Generate a dataset by running the teacher model on a large corpus of inputs and recording its outputs (either full probability distributions or generated text)
- Train the student model on this dataset, optimizing a loss that combines:
- KL divergence between student and teacher output distributions (soft label loss)
- Standard cross-entropy against ground truth labels (hard label loss)
- Optionally, intermediate layer matching (hidden state distillation)
When to use distillation:
- You need to serve a model at a latency or cost that the large model cannot meet
- You have a specific, well-defined task where a smaller model can reach sufficient quality
- You want to deploy on edge devices or mobile where the large model cannot run
- You need to reduce the number of GPUs required for serving
When NOT to use distillation:
- The task requires the large model's full breadth of knowledge (open-domain QA)
- You do not have enough compute to generate a sufficiently large training dataset from the teacher
- The quality gap between student and teacher is too large for the application's requirements
Practical considerations:
- Dataset quality matters more than dataset size. Curate the teacher's inputs to cover the distribution you expect in production, including edge cases.
- Task-specific distillation dramatically outperforms general-purpose distillation. A 7B model distilled for code review can match a 70B model on code review while being much worse at other tasks.
- Evaluate the student on your production eval suite, not just on generic benchmarks. Benchmarks can be misleading for specific use cases.
See Model Optimization and AI/ML System Design for the full system context.
Question 15: Design a System for Serving Multiple LLMs with Dynamic Routing
What the interviewer is really asking: Can you build LLM infrastructure that is cost-effective, reliable, and flexible enough to support multiple models for different use cases? This is the capstone system design question that tests both LLM knowledge and infrastructure design skills.
Answer framework:
Problem statement. An organization uses multiple LLMs (different sizes, providers, and specializations) for different tasks. Design a system that routes requests to the optimal model based on task complexity, cost budget, latency requirements, and model availability.
Architecture components:
Request classifier. A lightweight model (or rule-based system) that analyzes incoming requests and assigns a complexity tier:
- Tier 1 (simple): Factual lookups, formatting, simple classification -> route to smallest/cheapest model
- Tier 2 (moderate): Summarization, standard Q&A, code completion -> route to mid-size model
- Tier 3 (complex): Multi-step reasoning, creative writing, complex code generation -> route to largest model
Model registry. Tracks available models, their capabilities, current load, cost per token, and health status. Each model has a capability profile (what tasks it handles well) and operational metadata (latency p50/p99, throughput, error rate).
Router. Given the request classification and the model registry, select the optimal model. The routing policy considers: task-model fit (does this model handle this task type well?), current load (is the preferred model overloaded?), cost (is the user on a budget-constrained tier?), and latency requirements (does this request need streaming?).
Fallback chain. If the selected model is unavailable or returns an error, cascade to the next model in the chain. Implement circuit breakers per model to avoid hammering a failing service. The fallback chain should always terminate at a reliable default, even if it is a simpler model.
Observability. Log every routing decision, model response, and downstream outcome. Track per-model quality metrics, cost, and latency. Use this data to continuously refine the routing policy.
This question ties together everything: model knowledge, system design, reliability engineering, and cost optimization. See Load Balancing, Circuit Breakers, and the System Design Interview Guide for the infrastructure foundations. Practice end-to-end system design on AlgoRoq's platform.
How to Practice
LLM interview preparation requires both theoretical understanding and hands-on experience. Here is a structured approach:
-
Build something with an LLM API. Deploy a RAG system, fine-tune a small model, or build a multi-step agent. The experience of debugging context window overflows, handling API errors, and optimizing cost will give you stories to tell in interviews that no amount of reading can provide.
-
Read the foundational papers. "Attention Is All You Need" (transformers), "Language Models are Few-Shot Learners" (GPT-3 / in-context learning), "Training language models to follow instructions with human feedback" (InstructGPT / RLHF), and "LoRA: Low-Rank Adaptation of Large Language Models". Understanding these papers gives you the vocabulary to discuss LLMs at the level interviewers expect.
-
Study production case studies. Read engineering blogs from companies deploying LLMs at scale: how they handle prompt management, evaluation, guardrails, and cost optimization. These practical details are what separate good interview answers from great ones.
-
Practice explaining concepts at multiple levels. You might need to explain attention to a systems engineer who has no ML background, or discuss RLHF tradeoffs with an ML researcher. Practice adjusting your depth and vocabulary for different audiences.
-
Use mock interviews. LLM questions are open-ended and can go in many directions depending on the interviewer's follow-up questions. Practice with AlgoRoq's mock interview platform to build comfort with the unpredictability of live interviews.
-
Cross-train in systems. Many LLM interview questions are really system design questions with an LLM component. Strengthen your foundation with distributed systems, API design, and caching strategies.
Common Mistakes to Avoid
-
Treating LLMs as deterministic systems. LLMs are probabilistic. Any design that assumes the model will always produce the exact same output for the same input is fragile. Build validation, retry logic, and fallback strategies into every LLM-powered feature.
-
Ignoring cost. LLM inference is expensive at scale. Saying "we will use GPT-4 for everything" without discussing cost optimization (caching, routing, distillation, batching) signals that you have not operated LLMs in production. Always include a cost analysis in your system design.
-
No evaluation story. If you propose using an LLM for a task, you must explain how you will measure whether it is working correctly. "We will look at the outputs and see if they are good" is not a strategy. Define metrics, build eval suites, and automate evaluation.
-
Skipping the guardrails discussion. Every production LLM application needs input validation, output filtering, and fallback behavior. Omitting these in your design tells the interviewer you have not thought about failure modes.
-
Over-indexing on model architecture. While understanding transformers is important, interviewers at product companies care more about how you integrate LLMs into production systems than about your knowledge of attention head pruning. Spend 30% on architecture and 70% on system design, evaluation, and operations.
-
Confusing fine-tuning with prompting. Know when each approach is appropriate. Reaching for fine-tuning when prompt engineering would suffice wastes time and resources. Reaching for prompt engineering when the task requires fine-tuning leads to brittle, unreliable systems.
-
Ignoring latency and user experience. LLM inference is slow compared to traditional APIs. Discuss streaming, progressive rendering, and UX patterns that make the wait tolerable. A 5-second time-to-first-token is unacceptable for interactive applications.
-
Not discussing data privacy. If user data is sent to an external LLM provider, discuss the privacy implications. Mention options like self-hosted models, data processing agreements, PII redaction before API calls, and compliance requirements (GDPR, HIPAA). Senior engineers are expected to raise these concerns proactively.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.