AI Engineering Guide — RAG to Multi-Agent Systems
Complete AI engineering guide covering RAG pipelines, vector databases, LLM serving, fine-tuning, multi-agent systems, and production deployment strategies.
AI Engineering Guide — RAG to Multi-Agent Systems
AI engineering is the discipline of building production applications powered by large language models and related AI infrastructure. It sits at the intersection of machine learning, software engineering, and systems design. Unlike ML research, which focuses on improving model capabilities, AI engineering focuses on making models useful, reliable, and cost-effective in production.
This guide covers the full AI engineering stack — from retrieval-augmented generation (RAG) pipelines to multi-agent systems, from embedding models to production LLM deployment. Each section includes practical implementation details, trade-offs, and real-world examples from companies operating AI systems at scale. Whether you are building your first RAG application or designing multi-agent architectures, this is the reference you need.
If you are preparing for AI engineering interviews at companies building LLM-powered products, this guide covers the concepts that interviewers expect. For broader system design preparation, see our system design interview guide.
Table of Contents
- RAG Pipelines: Architecture and Implementation
- Vector Databases
- Embedding Models
- Chunking Strategies
- Fine-Tuning vs. RAG: When to Use Each
- Prompt Engineering for Production
- LLM Serving Infrastructure
- Multi-Agent Systems
- Model Context Protocol (MCP)
- AI Guardrails and Safety
- Evaluation Frameworks
- Production LLM Deployment
- Cost Optimization
- How to Study This Material
- Related Resources
RAG Pipelines: Architecture and Implementation
Retrieval-Augmented Generation (RAG) is the most widely adopted pattern for giving LLMs access to external knowledge. Instead of relying solely on the model's training data, RAG retrieves relevant documents from a knowledge base and includes them in the prompt as context.
Why RAG?
LLMs have two fundamental limitations that RAG addresses:
- Knowledge cutoff: Models are trained on data up to a certain date. They do not know about events, products, or changes after that date.
- Hallucination: Models confidently generate plausible-sounding but incorrect information, especially for domain-specific or rare topics.
RAG mitigates both by grounding the model's responses in retrieved, verified documents. The model generates answers based on the provided context rather than its parametric memory.
RAG Architecture
The RAG Pipeline Step by Step
Step 1: Document Ingestion
Load source documents (PDFs, web pages, Markdown files, database records, Confluence pages, Slack messages). Parse them into plain text, preserving structure (headings, tables, code blocks) where possible.
Tools: Unstructured.io, LlamaIndex document loaders, LangChain document loaders, Apache Tika.
Step 2: Chunking
Split documents into smaller pieces (chunks) that fit within the LLM's context window and are semantically coherent. Chunking strategy significantly affects retrieval quality. See the dedicated chunking section below.
Step 3: Embedding
Convert each chunk into a dense vector (embedding) that captures its semantic meaning. Similar chunks have similar vectors (high cosine similarity). Store the vectors in a vector database along with the original text.
Step 4: Query Processing
When a user submits a query:
- Convert the query to an embedding using the same embedding model.
- Search the vector database for the top-k chunks most similar to the query embedding.
- Optionally rerank the results using a cross-encoder model for better precision.
- Format the retrieved chunks into the LLM prompt as context.
Step 5: Generation
The LLM generates an answer using the retrieved context. The prompt typically includes:
- A system instruction explaining the task and constraints.
- The retrieved context documents.
- The user's question.
- Instructions to cite sources and say "I don't know" if the context does not contain the answer.
Advanced RAG Techniques
Hybrid Search: Combine dense vector search (semantic similarity) with sparse keyword search (BM25). Dense search captures meaning ("How do I deploy?" matches "deployment guide"). Sparse search captures exact terms ("error code E-4012" matches the exact string). Fusion combines both result sets for better recall.
Query Transformation: The raw user query may not be the best search query. Techniques include:
- Query rewriting: Use an LLM to rephrase the query for better retrieval ("What's the price?" → "pricing plans and subscription tiers").
- Query decomposition: Break a complex query into sub-queries, retrieve for each, and synthesize.
- HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, embed it, and use that embedding for retrieval. The hypothesis is closer to the document space than the query.
Multi-step RAG (Agentic RAG): For complex questions that require synthesizing information from multiple documents, use an agent loop:
- Retrieve initial documents.
- Check if the retrieved information is sufficient.
- If not, formulate a follow-up query and retrieve more documents.
- Repeat until sufficient information is gathered.
- Generate the final answer.
Parent Document Retrieval: Embed small chunks for precise retrieval, but return the larger parent document (or surrounding context) to the LLM for generation. This gives the LLM more context to generate accurate answers while maintaining retrieval precision.
Real-World: How Notion Implements RAG
Notion's AI Q&A feature uses RAG to answer questions about a user's workspace. Key design decisions:
- Chunks are scoped to individual Notion pages. Each page is treated as a potential retrieval unit.
- Permissions are enforced at retrieval time — the system only returns chunks from pages the user has access to.
- Embeddings are pre-computed and incrementally updated when pages change.
- The system uses a combination of semantic search and keyword search (hybrid) to handle both natural language questions and exact-match queries.
Real-World: How Cursor Uses RAG for Code
Cursor (the AI code editor) uses RAG to understand a codebase and provide context-aware code suggestions:
- The codebase is indexed by chunking files into function-level and class-level segments.
- When the user asks a question or requests a code change, the system retrieves relevant code chunks based on semantic similarity to the prompt.
- Retrieved code is included in the LLM context, enabling the model to generate code that is consistent with the existing codebase's patterns, naming conventions, and architecture.
Vector Databases
Vector databases are purpose-built for storing and searching high-dimensional vectors (embeddings). They provide efficient approximate nearest neighbor (ANN) search, which finds the vectors most similar to a query vector.
How Vector Search Works
Exact nearest neighbor search compares the query vector against every vector in the database. This is O(n) and prohibitively slow for large datasets.
Approximate nearest neighbor (ANN) search uses indexing structures to find approximate results much faster, trading a small amount of accuracy for orders-of-magnitude speedup.
Common ANN algorithms:
HNSW (Hierarchical Navigable Small World):
- Builds a multi-layer graph where each node is connected to its nearest neighbors.
- Search starts at the top layer (sparse, long-range connections) and descends to lower layers (dense, short-range connections).
- Very fast query time, high recall, but high memory usage (stores the graph in memory).
- Used by: Pinecone, Weaviate, Qdrant, pgvector.
IVF (Inverted File Index):
- Clusters vectors into groups using k-means.
- At query time, only search the clusters closest to the query vector.
- Lower memory than HNSW but slower queries.
- Often combined with product quantization (IVF-PQ) for further compression.
- Used by: FAISS, Milvus.
ScaNN (Scalable Nearest Neighbors):
- Google's approach using learned quantization and asymmetric distance computation.
- Optimized for high-throughput, low-latency serving.
- Used by: Google's internal systems, available as a library.
Comparing Vector Databases
| Database | Type | Hosting | Key Strength |
|---|---|---|---|
| Pinecone | Managed cloud | SaaS only | Simplest to operate, built for production |
| Weaviate | Open source | Self-hosted or cloud | Hybrid search (vector + keyword), modules |
| Qdrant | Open source | Self-hosted or cloud | High performance, Rust-based, filtering |
| Milvus | Open source | Self-hosted or Zilliz cloud | Handles billion-scale datasets |
| Chroma | Open source | Embedded or client-server | Developer-friendly, great for prototyping |
| pgvector | PostgreSQL extension | Wherever Postgres runs | Use your existing Postgres, no new infra |
When to Use pgvector vs. a Dedicated Vector Database
Use pgvector when:
- You already use PostgreSQL and want to avoid adding new infrastructure.
- Your dataset is under 5-10 million vectors.
- You need transactional guarantees (ACID) for your vectors and metadata.
- Your query patterns combine vector search with SQL filters.
Use a dedicated vector database when:
- Your dataset exceeds 10 million vectors.
- You need sub-10ms query latency at scale.
- You need advanced features like multi-tenancy, sharding, or real-time indexing.
- Vector search is a core part of your product, not a secondary feature.
Metadata Filtering
In production RAG systems, you rarely want to search all vectors. You need to filter by metadata first:
- Only search documents the user has permission to access.
- Only search documents from a specific date range.
- Only search documents of a specific type (e.g., only API documentation, not blog posts).
Efficient metadata filtering is critical. Some vector databases handle this well (Pinecone, Qdrant, Weaviate), while others require post-filtering (search all, then filter), which is much slower.
Embedding Models
Embedding models convert text into dense numerical vectors that capture semantic meaning. Two texts with similar meanings have vectors with high cosine similarity, even if they use different words.
How Embeddings Work
Modern embedding models are transformer-based neural networks (typically BERT-family architectures) trained on large text corpora. Training uses a contrastive learning objective: the model learns to produce similar vectors for semantically similar text pairs and dissimilar vectors for unrelated pairs.
Choosing an Embedding Model
| Model | Dimensions | Max Tokens | Best For |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8191 | General purpose, high quality |
| OpenAI text-embedding-3-small | 1536 | 8191 | Lower cost, good quality |
| Cohere embed-v3 | 1024 | 512 | Multilingual, search optimized |
| Voyage AI voyage-3 | 1024 | 16000 | Code and technical content |
| BGE-large-en-v1.5 | 1024 | 512 | Open source, self-hosted |
| GTE-Qwen2-7B-instruct | 3584 | 32768 | Open source, long context |
| Nomic embed-text-v1.5 | 768 | 8192 | Open source, Matryoshka dimensions |
Key Considerations
Dimensionality: Higher dimensions capture more nuance but increase storage cost and query latency. OpenAI's text-embedding-3 models support dimensionality reduction — you can request 256, 512, or 1024 dimensions instead of the full 3072, trading quality for efficiency.
Domain specificity: General-purpose embeddings (OpenAI, Cohere) work well for most domains. For specialized domains (medical, legal, code), fine-tuned embeddings significantly outperform general models. See our article on fine-tuning embedding models for implementation details.
Consistency: You must use the same embedding model for indexing and querying. If you change your embedding model, you must re-embed all documents. This is a significant operational consideration — plan for model upgrades from the start.
Matryoshka embeddings: Some models (Nomic, OpenAI text-embedding-3) support "Matryoshka" representations where the first N dimensions are a useful lower-dimensional embedding. This lets you use shorter vectors for fast initial retrieval and full vectors for reranking.
Fine-Tuning Embeddings
General-purpose embedding models may not perform well for domain-specific retrieval. Fine-tuning trains the model on your domain's data to produce better embeddings for your use case.
When to fine-tune:
- Your domain has specialized vocabulary (medical, legal, financial).
- General embeddings retrieve irrelevant results for domain-specific queries.
- You have labeled training data (query-document pairs with relevance labels).
Training approach:
- Collect positive pairs (query, relevant document) and hard negatives (query, similar-but-irrelevant document).
- Fine-tune using a contrastive loss (e.g., Multiple Negatives Ranking Loss).
- Evaluate with retrieval metrics (Recall@k, NDCG, MRR).
For a step-by-step implementation, see our deep dive on fine-tuning embedding models for domain-specific retrieval.
Chunking Strategies
Chunking is how you split documents into pieces for embedding and retrieval. The chunking strategy has an outsized impact on RAG quality — bad chunking leads to irrelevant retrieval, which leads to bad answers.
Fixed-Size Chunking
Split text into chunks of a fixed number of tokens (e.g., 512 tokens) with overlap (e.g., 50 tokens).
Pros: Simple, consistent chunk sizes, predictable token usage. Cons: Chunks may split mid-sentence or mid-paragraph, breaking semantic coherence.
Semantic Chunking
Split text at natural boundaries (paragraphs, sections, headings) to preserve semantic coherence.
Pros: Preserves semantic meaning within chunks, better retrieval quality. Cons: Variable chunk sizes, some chunks may be very small or very large.
Recursive Character Splitting
The approach used by LangChain's RecursiveCharacterTextSplitter. Try splitting on the largest separator first (double newline), then fall back to smaller separators (single newline, period, space) to achieve the target chunk size.
Pros: Good balance between semantic coherence and chunk size consistency. Cons: Still syntactic, not truly semantic.
LLM-Based Chunking (Proposition-Level)
Use an LLM to decompose a document into self-contained propositions (facts, claims, definitions). Each proposition is embedded separately.
Example: The sentence "PostgreSQL is a relational database that supports JSONB, full-text search, and the pgvector extension for vector similarity search" might be decomposed into:
- "PostgreSQL is a relational database."
- "PostgreSQL supports JSONB."
- "PostgreSQL supports full-text search."
- "PostgreSQL supports the pgvector extension for vector similarity search."
Pros: Each chunk is a self-contained, searchable fact. Excellent retrieval precision. Cons: Expensive (requires LLM calls for chunking), loses context (each fact is isolated).
Chunking Best Practices
- Include metadata: Add the document title, section heading, and source URL to each chunk. This helps the LLM cite sources and provides context even if the chunk is small.
- Overlap: Use 10-20% overlap between adjacent chunks to avoid losing information at boundaries.
- Test empirically: There is no universally best chunking strategy. Evaluate different approaches on your specific queries and documents using retrieval metrics.
- Chunk size sweet spot: 200-500 tokens works well for most use cases. Smaller chunks increase precision but may miss context. Larger chunks provide more context but reduce retrieval precision.
Fine-Tuning vs. RAG: When to Use Each
Fine-tuning and RAG are the two primary approaches for customizing LLM behavior. They solve different problems and are often complementary.
RAG: Adding External Knowledge
RAG gives the model access to information it was not trained on. Use RAG when:
- You need the model to answer questions about your company's documents, products, or data.
- The knowledge changes frequently (product catalog, pricing, documentation).
- You need to cite sources for the model's answers.
- You need to enforce access control (only retrieve documents the user can see).
Fine-Tuning: Changing Model Behavior
Fine-tuning modifies the model's weights to change how it responds. Use fine-tuning when:
- You need the model to adopt a specific tone, style, or format consistently.
- You need the model to learn domain-specific reasoning (medical diagnosis, legal analysis).
- You need to reduce prompt length by "baking in" instructions.
- You need better performance on a specific task (classification, extraction, summarization).
Decision Framework
| Need | Solution |
|---|---|
| Access to private/recent data | RAG |
| Specific output format | Fine-tuning (or structured output APIs) |
| Domain-specific reasoning | Fine-tuning |
| Citing sources | RAG |
| Reducing hallucination on facts | RAG |
| Reducing prompt token usage | Fine-tuning |
| Consistent tone/personality | Fine-tuning |
| Data changes frequently | RAG |
Using Both Together
The most effective production systems combine RAG and fine-tuning:
- Fine-tune the model to follow your output format, tone, and reasoning patterns.
- Use RAG to provide the model with current, private, and domain-specific data.
- Use prompt engineering to bind the two together — instruct the fine-tuned model to use the RAG context.
Cost Comparison
| Approach | Upfront Cost | Ongoing Cost | Time to Implement |
|---|---|---|---|
| Prompt engineering | None | Higher per-request (longer prompts) | Hours |
| RAG | Moderate (embedding pipeline, vector DB) | Moderate (embedding + retrieval + generation) | Days-Weeks |
| Fine-tuning | Higher (training compute, data preparation) | Lower per-request (shorter prompts) | Weeks |
| RAG + Fine-tuning | Highest upfront | Lowest per-request | Weeks-Months |
Prompt Engineering for Production
Prompt engineering in production is fundamentally different from experimenting in a playground. Production prompts must be reliable, testable, versioned, and robust to edge cases.
Principles for Production Prompts
1. Be explicit about the task
Do not assume the model understands your intent. State exactly what you want, what format the output should be in, and what the model should do when it encounters edge cases.
2. Use structured output
For any output that will be parsed by code, use structured output formats (JSON, XML) and validate the output against a schema.
3. Version your prompts
Treat prompts like code. Store them in version control, tag releases, and maintain changelogs. When you change a prompt, run your evaluation suite to detect regressions.
4. Include few-shot examples
For complex tasks, include 2-3 examples of the desired input-output mapping in the prompt. Few-shot examples are more reliable than instructions for complex formats.
5. Handle failures gracefully
The model will sometimes produce invalid output. Your code should:
- Validate the output against a schema.
- Retry with the same prompt (LLM outputs are non-deterministic, a retry may succeed).
- Fall back to a simpler prompt or a different model.
- Log failures for analysis.
Prompt Caching
For prompts with a large, stable prefix (system instructions + few-shot examples + RAG context), take advantage of provider-level prompt caching:
- Anthropic: Automatically caches prompt prefixes. Cached tokens are 90% cheaper.
- OpenAI: Automatic prefix caching for prompts sharing the same prefix.
- Self-hosted: vLLM and TensorRT-LLM support prefix caching via PagedAttention.
For more on prompt caching strategies and cost savings, see our prompt caching strategies guide.
LLM Serving Infrastructure
Serving LLMs in production requires specialized infrastructure to handle the unique characteristics of LLM inference: large model sizes, variable-length inputs and outputs, and the sequential nature of autoregressive generation.
Key Metrics
- Time to First Token (TTFT): Latency from request to the first generated token. Critical for interactive applications.
- Tokens per Second (TPS): Throughput of the serving system. Determines how many concurrent users you can support.
- Inter-Token Latency (ITL): Time between consecutive generated tokens. Affects the perceived "streaming" speed.
- Cost per Million Tokens: The total cost (compute + memory) divided by the number of tokens processed.
Serving Frameworks
vLLM:
- The most widely used open-source LLM serving framework.
- Key innovation: PagedAttention, which manages the KV cache like virtual memory pages, eliminating memory waste from pre-allocation.
- Supports continuous batching, prefix caching, speculative decoding.
- Excellent throughput — typically 2-3x faster than naive HuggingFace inference.
TensorRT-LLM (NVIDIA):
- Optimized for NVIDIA GPUs with TensorRT compilation.
- Best raw performance on NVIDIA hardware.
- More complex setup than vLLM.
Ollama:
- Designed for local development and small-scale deployment.
- Simple CLI:
ollama run llama3. - Not designed for production-scale serving.
Cloud Provider APIs (OpenAI, Anthropic, Google):
- Simplest to use — no infrastructure to manage.
- Pay per token.
- Best for most applications unless you need to self-host for data privacy, cost (at scale), or model customization.
Batching Strategies
Static batching: Wait for a batch of requests to accumulate, process them together. Simple but inefficient — short requests wait for long requests to complete.
Continuous batching (Iteration-level batching): Process each token generation step across all active requests simultaneously. When a request finishes, immediately start processing a new request. Used by vLLM, TRT-LLM, and most modern serving frameworks.
Speculative decoding: Use a small, fast "draft" model to predict several tokens ahead, then verify with the large model in a single forward pass. If the predictions are correct (which they often are for simple text), you skip multiple forward passes. Can achieve 2-3x speedup for long-form generation.
Scaling LLM Serving
Horizontal scaling: Deploy multiple replicas behind a load balancer. Route requests based on model and expected load. This is straightforward for API-based serving.
Model parallelism: For models too large to fit on one GPU, split the model across multiple GPUs:
- Tensor parallelism: Split individual layers across GPUs. Low latency but requires high-bandwidth GPU interconnect (NVLink).
- Pipeline parallelism: Split the model into stages, each on a different GPU. Higher latency but works with lower bandwidth interconnects.
Quantization: Reduce model precision from FP16 to INT8 or INT4, reducing memory usage by 2-4x and increasing throughput. Modern quantization methods (GPTQ, AWQ, GGUF) preserve most of the model's quality.
Multi-Agent Systems
Multi-agent systems use multiple LLM-powered agents that collaborate to accomplish complex tasks. Each agent has a specific role, set of tools, and expertise. Agents communicate with each other, delegate subtasks, and combine their outputs.
Why Multi-Agent?
Single-agent systems struggle with complex tasks that require:
- Multiple distinct skills (coding, research, data analysis, writing).
- Long task chains where the output of one step feeds the next.
- Parallel work that can be done simultaneously.
- Different levels of model capability (use a cheap model for simple tasks, an expensive model for hard ones).
Multi-Agent Architectures
Orchestrator-Worker
A central orchestrator agent decomposes the task and delegates subtasks to specialized worker agents.
Advantages: Clear control flow, easy to debug, deterministic task delegation. Disadvantages: The orchestrator is a bottleneck, single point of failure.
Pipeline (Sequential)
Agents are arranged in a sequence. Each agent processes the output of the previous agent.
Advantages: Simple, predictable, easy to reason about. Disadvantages: No parallelism, early-stage errors propagate through the pipeline.
Debate / Reflection
Two agents critique each other's work in alternating rounds. This is useful for tasks where quality improves through iteration (writing, code review, reasoning).
Tool Use in Multi-Agent Systems
Agents become powerful when they can use tools:
- Code execution: Run code in a sandboxed environment.
- Web search: Search the internet for current information.
- Database queries: Query internal databases.
- API calls: Interact with external services.
- File operations: Read and write files.
Real-World: Devin and AI Coding Agents
Devin (by Cognition) is a multi-agent coding system where:
- A planning agent decomposes a coding task into steps.
- A coding agent writes and modifies code.
- A testing agent runs tests and reports failures.
- A debugging agent analyzes test failures and proposes fixes.
- All agents share access to a code environment, terminal, and browser.
The key insight is that each agent has a focused context window and specialized prompt, rather than trying to handle everything in a single, massive prompt.
Challenges
Cost: Multi-agent systems multiply LLM calls. An orchestrator call plus three worker calls is 4x the cost of a single call. Use cheaper models (Haiku, GPT-4o-mini) for simple tasks.
Latency: Sequential agent calls add up. Parallelize independent tasks when possible.
Error propagation: If one agent produces bad output, downstream agents may amplify the error. Build validation and retry logic between agents.
Debugging: Tracing the flow of information across multiple agents is difficult. Implement comprehensive logging and tracing from day one.
Model Context Protocol (MCP)
The Model Context Protocol (MCP), introduced by Anthropic, is an open standard for connecting LLMs to external data sources and tools. It standardizes the interface between AI applications and the systems they interact with, similar to how HTTP standardized web communication.
Why MCP?
Before MCP, every AI application built custom integrations for each data source (databases, APIs, file systems, SaaS tools). This created an N x M integration problem: N AI applications each building M integrations.
MCP reduces this to an N + M problem: AI applications implement the MCP client interface, data sources implement the MCP server interface, and any client can connect to any server.
Architecture
MCP Capabilities
Resources: Data that the AI can read. A GitHub MCP server might expose repository files, pull requests, and issues as resources. The AI application can list available resources and read their contents.
Tools: Actions that the AI can perform. A Slack MCP server might expose tools like send_message, create_channel, search_messages. The AI application can call these tools as part of its workflow.
Prompts: Reusable prompt templates provided by the server. A database MCP server might provide a "query_builder" prompt template that helps the AI construct valid SQL queries.
Building an MCP Server
MCP in Practice
Claude Code (Anthropic's CLI) uses MCP to connect to local file systems, databases, and external services. Cursor integrates MCP servers for accessing project-specific tools and data. The ecosystem is growing rapidly, with community-built MCP servers for GitHub, Slack, Google Drive, PostgreSQL, MongoDB, and many other services.
AI Guardrails and Safety
Guardrails are mechanisms that constrain LLM behavior to prevent harmful, incorrect, or off-topic outputs. In production systems, guardrails are not optional — they are a requirement for trust and reliability.
Types of Guardrails
Input guardrails (pre-processing):
- Prompt injection detection: Detect and block attempts to override the system prompt.
- Topic filtering: Reject queries that are outside the system's intended scope.
- PII detection: Redact or block inputs containing personal information.
- Rate limiting: Prevent abuse through excessive requests.
Output guardrails (post-processing):
- Content filtering: Block outputs containing harmful, biased, or inappropriate content.
- Hallucination detection: Check generated facts against the retrieved context.
- Format validation: Ensure outputs conform to the expected schema.
- Citation verification: Ensure claimed sources actually support the generated statements.
Implementation Patterns
Layered approach:
Layer 1 catches obvious violations cheaply (under 1ms). Layer 2 uses lightweight classifier models for nuanced detection (10-50ms). Layer 3 uses an LLM to judge output quality for cases where classification is insufficient (100-500ms).
Prompt Injection Defense
Prompt injection is the most critical security threat to LLM applications. An attacker crafts input that overrides the system prompt, causing the LLM to ignore its instructions.
Defense strategies:
- Input/output separation: Clearly delimit user input with XML tags or special tokens so the model can distinguish instructions from data.
- Instruction hierarchy: Use Anthropic's or OpenAI's system prompt features that give system instructions higher priority than user input.
- Output validation: Even if the model is tricked, validate the output before acting on it. Never trust LLM output for security-critical decisions.
- Canary tokens: Include a secret token in the system prompt. If the model's output contains the token, injection was likely attempted.
Evaluation Frameworks
Evaluation is the foundation of iterative improvement for LLM applications. Without systematic evaluation, you are making changes blindly and have no way to detect regressions.
Evaluation Dimensions
Retrieval quality (for RAG):
- Recall@k: What fraction of relevant documents appear in the top-k retrieved results?
- Precision@k: What fraction of the top-k retrieved results are relevant?
- NDCG (Normalized Discounted Cumulative Gain): Are the most relevant documents ranked highest?
- MRR (Mean Reciprocal Rank): How high is the first relevant document ranked?
Generation quality:
- Faithfulness: Does the answer accurately reflect the retrieved context? (No hallucination)
- Relevance: Does the answer address the user's question?
- Completeness: Does the answer cover all aspects of the question?
- Coherence: Is the answer well-structured and readable?
Evaluation Methods
1. Deterministic metrics: Exact match, F1, BLEU, ROUGE. Fast and reproducible but limited to cases where there is a known correct answer.
2. LLM-as-judge: Use a strong LLM (GPT-4, Claude) to evaluate another LLM's output. Provide a rubric and examples. More nuanced than deterministic metrics but introduces variability.
3. Human evaluation: The gold standard but expensive and slow. Use for calibrating LLM-as-judge and for weekly quality reviews on a sample of outputs.
Building an Evaluation Pipeline
- Create a test set: 100-500 question-answer pairs covering representative queries, edge cases, and known failure modes.
- Automate evaluation: Run the test set through your pipeline, evaluate each output, and produce a score.
- Integrate with CI/CD: Run evaluations on every prompt change, model change, or RAG configuration change. Block deployment if scores drop below thresholds.
- Track metrics over time: Plot evaluation scores on a dashboard. Catch gradual degradation.
For a complete implementation guide, see our article on building reliable LLM evaluation pipelines.
Production LLM Deployment
Deploying LLMs to production requires handling concerns that do not exist in prototyping: reliability, latency, cost, observability, and graceful degradation.
Architecture for Production
Key Production Concerns
Model routing: Route requests to different models based on complexity, cost, and latency requirements. Use a cheap model (GPT-4o-mini, Claude Haiku) for simple tasks and an expensive model (GPT-4o, Claude Opus) for complex tasks.
Fallback chains: If the primary model is unavailable or returns an error, fall back to an alternative model. Example: Claude Opus -> GPT-4o -> Claude Sonnet -> cached response.
Streaming: Stream responses token-by-token to reduce perceived latency. The user sees the first token within 100-500ms rather than waiting 3-10 seconds for the full response.
Caching: Cache responses for identical or semantically similar prompts. Reduces cost and latency for repeated queries. See our article on prompt caching strategies.
Observability: Log every request and response (or a sample for high-volume systems). Track latency, token usage, error rates, and model distribution. Tools: LangSmith, Helicone, Braintrust, OpenTelemetry.
Graceful degradation: When the LLM is slow or unavailable, provide a degraded but useful experience. Options:
- Show cached results.
- Fall back to a keyword search.
- Display a "please try again" message with an estimated wait time.
- Queue the request and notify the user when the response is ready.
Cost Optimization
LLM costs can escalate rapidly. At scale, the difference between a well-optimized and poorly optimized system can be 10-50x in cost.
Cost Levers
1. Model selection: The biggest cost lever. GPT-4o-mini is approximately 30x cheaper than GPT-4o for input tokens. Use the cheapest model that produces acceptable quality for each task.
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o-mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Llama 3.1 70B (self-hosted) | ~$0.50 | ~$0.50 |
2. Prompt optimization: Shorter prompts cost less. Techniques:
- Remove redundant instructions.
- Use few-shot examples only when needed (test with zero-shot first).
- Compress RAG context — only include the most relevant chunks, not all retrieved chunks.
- Use fine-tuning to "bake in" instructions and reduce prompt length.
3. Caching: Cache responses for repeated queries. Even a 30% cache hit rate significantly reduces costs.
4. Batching: For non-real-time use cases (batch processing, nightly report generation), use batch APIs. OpenAI's Batch API offers 50% cost reduction for requests that can tolerate 24-hour latency.
5. Self-hosting: At very high volume (millions of requests per day), self-hosting an open-source model (Llama 3, Mistral) on your own GPUs can be 3-5x cheaper than API pricing. But factor in the engineering cost of operating the infrastructure.
Cost Monitoring
Implement per-feature cost tracking from day one. Know how much each feature costs per user per month. Set up alerts for cost anomalies — a prompt injection that triggers a long output or a retry loop can generate unexpected costs.
How to Study This Material
AI engineering is evolving rapidly. The tools and best practices change every few months. Focus on understanding the principles and patterns rather than memorizing specific tool configurations.
Phase 1: Build a RAG Application (1-2 weeks)
- Pick a document corpus (your company's docs, a textbook, Wikipedia subset).
- Build a basic RAG pipeline: chunk documents, embed with OpenAI, store in Chroma or pgvector, query with a simple prompt.
- Evaluate retrieval quality manually: are the right documents being retrieved?
- Iterate on chunking strategy and prompt design.
Phase 2: Production Hardening (1-2 weeks)
- Add hybrid search (BM25 + vector search).
- Implement a reranker (Cohere or a cross-encoder model).
- Add guardrails (input validation, output validation, citation checking).
- Build an evaluation pipeline with a test set of 50+ question-answer pairs.
- Add caching and cost tracking.
Phase 3: Advanced Patterns (2-3 weeks)
- Build an agentic RAG system that iteratively retrieves and reasons.
- Experiment with multi-agent architectures (orchestrator-worker, pipeline).
- Implement MCP for connecting to external data sources.
- Fine-tune an embedding model on your domain data.
- Set up LLM observability (LangSmith, Helicone, or custom logging).
Stay Current
- Follow the Anthropic, OpenAI, and Google AI engineering blogs.
- Read Latent Space, The Batch, and AI Engineering newsletters.
- Try new tools and models monthly — the landscape changes quickly.
Related Resources
Algoroq Concept Deep Dives
- Building Reliable LLM Evaluation Pipelines — automated metrics, LLM-as-judge, and CI/CD integration
- Prompt Caching Strategies — reduce LLM costs by 50%+ with intelligent caching
- Fine-Tuning Embedding Models — improve RAG retrieval with domain-specific embeddings
System Design
- System Design Interview Questions — AI system design is increasingly common in interviews
- How to Design a URL Shortener — foundational system design concepts
Architecture
- Software Architecture Patterns — event-driven architecture and microservices for AI systems
- Distributed Systems Fundamentals — infrastructure foundations for AI serving
Technology Comparisons
- Kafka vs RabbitMQ — message queues for event-driven AI pipelines
Company Case Studies
- How Netflix Scales — large-scale ML infrastructure and personalization
- Google System Design Interview — AI system design in Google interviews
Career
- Senior to Staff Engineer — AI engineering leadership and career growth
Learning Paths
- Algoroq Live Cohort — 12-week program covering AI engineering with hands-on projects
- Self-Paced Learning — study AI engineering concepts at your own pace
- Best System Design Courses Compared — find the right learning resource for your goals
GO DEEPER
Learn from senior engineers in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.