INTERVIEW_QUESTIONS
RAG Interview Questions for Senior Engineers (2026)
15 advanced Retrieval-Augmented Generation interview questions with detailed answer frameworks covering chunking strategies, embedding models, vector stores, retrieval pipelines, reranking, evaluation metrics, and hallucination reduction techniques used at top AI companies.
Why RAG Matters in Senior AI Engineering Interviews
Retrieval-Augmented Generation has become the dominant pattern for building production LLM applications that require factual accuracy and domain-specific knowledge. Unlike fine-tuning, RAG allows organizations to ground LLM responses in their own data without the cost and complexity of model training. Every major technology company building AI products, from Google to startups, now expects senior engineers to design, optimize, and evaluate RAG systems.
Interviewers asking RAG questions are evaluating whether you understand the full pipeline from document ingestion to response generation. They want to see that you can reason about chunking trade-offs, select appropriate embedding models, design efficient retrieval strategies, and measure system quality with rigorous evaluation metrics. A strong candidate demonstrates practical experience with the failure modes of RAG systems: irrelevant retrieval, context window overflow, hallucination despite retrieval, and latency bottlenecks.
At companies like OpenAI, Anthropic, Google DeepMind, and AI-native startups, the RAG design round often determines whether a candidate can ship production AI systems versus only prototype them. For a structured preparation plan, see our system design interview guide and explore learning paths tailored to AI engineering roles.
1. What is Retrieval-Augmented Generation and why is it preferred over fine-tuning for most enterprise applications?
What the interviewer is really asking: Do you understand the fundamental trade-offs between RAG and fine-tuning, and can you articulate when each approach is appropriate?
Answer framework:
RAG is an architecture pattern where an LLM generates responses by first retrieving relevant documents from an external knowledge base and including them in the prompt context. The LLM then synthesizes an answer grounded in the retrieved evidence rather than relying solely on its parametric knowledge.
RAG is preferred over fine-tuning in most enterprise settings for several reasons. First, data freshness: RAG can incorporate new documents immediately without retraining, while fine-tuning requires a new training run every time knowledge changes. For a company with policies, product catalogs, or documentation that update weekly, RAG is the only practical choice. Second, attribution and auditability: RAG can cite specific source documents, which is critical for compliance-sensitive industries like healthcare, legal, and finance. Fine-tuned models cannot easily explain which training data influenced a specific response. Third, cost: fine-tuning large models costs thousands of dollars per run and requires ML infrastructure expertise. RAG requires only an embedding model, a vector database, and retrieval logic.
Fine-tuning is still valuable when you need to change the model's style, tone, or reasoning patterns rather than inject factual knowledge. A hybrid approach works well: fine-tune a base model for domain-specific language understanding, then use RAG for factual grounding. For deeper comparison, see our guide on LLM deployment strategies.
Common mistakes: claiming fine-tuning is always inferior, ignoring the latency cost of retrieval, not mentioning that RAG quality depends entirely on retrieval quality.
2. How would you design a chunking strategy for a large heterogeneous document corpus?
What the interviewer is really asking: Can you reason about the fundamental unit of retrieval and understand how chunk size, overlap, and structure affect downstream performance?
Answer framework:
Chunking is the process of splitting documents into segments that become the atomic units of retrieval. The chunking strategy directly determines retrieval precision and recall, so it deserves careful design.
Start with the key trade-off: smaller chunks increase precision (each chunk is tightly focused) but hurt recall and lose context. Larger chunks preserve context but may dilute relevance scores when only a portion is relevant. A typical starting point is 256 to 512 tokens with 10 to 20 percent overlap, but this must be tuned per use case.
For heterogeneous corpora, use document-type-aware chunking. Structured documents like API references should be chunked by section or endpoint. Long-form prose like blog posts or reports should use recursive character splitting with semantic awareness, splitting first by heading, then by paragraph, then by sentence. Tabular data should keep tables intact as single chunks with descriptive context prepended. Code should be chunked by function or class with docstrings attached.
Semantic chunking is an advanced approach: use an embedding model to detect topic boundaries within a document. Compute cosine similarity between consecutive sentence embeddings. When similarity drops below a threshold, introduce a chunk boundary. This produces chunks that are semantically coherent rather than arbitrarily split.
Always attach rich metadata to chunks: source document, section heading, page number, document type, and creation date. This metadata enables filtering during retrieval and improves result quality.
Common mistakes: using a single chunk size for all document types, ignoring overlap which causes context loss at boundaries, chunking code the same way as prose.
3. How do you select an embedding model for a RAG system, and what trade-offs do you consider?
What the interviewer is really asking: Do you understand the embedding model landscape beyond just using whatever OpenAI offers, and can you evaluate models rigorously?
Answer framework:
Embedding model selection is a critical architectural decision that affects retrieval quality, latency, cost, and infrastructure requirements. Evaluate models across five dimensions.
First, retrieval quality on your domain. Benchmarks like MTEB provide general rankings, but what matters is performance on your specific data. Create a golden evaluation set of 200 or more query-document pairs and measure recall@k and NDCG. Models fine-tuned on your domain (or a similar domain) often outperform larger general-purpose models.
Second, dimensionality and storage cost. OpenAI text-embedding-3-large produces 3072-dimensional vectors. Cohere embed-v3 produces 1024-dimensional vectors. Higher dimensions capture more nuance but increase storage and search latency linearly. For a corpus of 10 million chunks at 1536 dimensions with float32 precision, vector storage alone requires approximately 60 GB. Consider whether dimensionality reduction via Matryoshka representation learning or PCA is acceptable.
Third, latency and throughput. Self-hosted models like BGE or E5 on a GPU can embed a batch of 256 texts in under 100ms. API-based models add network latency of 50 to 200ms per call. For real-time applications, local models or cached embeddings are essential.
Fourth, multilingual support. If your corpus includes multiple languages, models like multilingual-e5-large or Cohere embed-v3 handle cross-lingual retrieval natively.
Fifth, licensing and deployment constraints. Open-source models like BGE, E5, and GTE can be self-hosted with full control. Proprietary API models like OpenAI and Cohere are easier to start with but create vendor lock-in and require sending data externally.
For most production systems, start with a strong open-source model like BGE-large-en-v1.5 or E5-mistral-7b-instruct, evaluate on your domain, and only move to a proprietary model if retrieval quality is insufficient. See our embedding model comparison for detailed benchmarks.
4. Explain the retrieval pipeline in a production RAG system. What happens between the user query and the LLM prompt?
What the interviewer is really asking: Can you describe the full retrieval pipeline with all the stages that distinguish a production system from a tutorial demo?
Answer framework:
A production RAG retrieval pipeline has six stages, each adding quality and reliability.
Stage 1: Query understanding. Parse the user query to extract intent and entities. Apply query rewriting to transform conversational queries into retrieval-friendly forms. For multi-turn conversations, use the LLM to resolve coreferences and incorporate chat history into a standalone query. For example, if the user asks "What about its pricing?" after discussing Kafka, rewrite to "What is the pricing model for Apache Kafka?"
Stage 2: Query expansion. Generate multiple query variations to improve recall. Use techniques like HyDE (Hypothetical Document Embeddings) where the LLM generates a hypothetical answer and you embed that instead of the raw query. This bridges the vocabulary gap between questions and documents.
Stage 3: Retrieval. Execute vector similarity search against the vector database using the query embedding. Use hybrid search combining dense vector retrieval with sparse keyword matching (BM25) for better coverage. Apply metadata filters to scope results by document type, date range, or access permissions.
Stage 4: Reranking. The initial retrieval returns candidates ranked by embedding similarity, which is a coarse signal. Apply a cross-encoder reranker (like Cohere Rerank or a fine-tuned cross-encoder) that scores each query-document pair jointly. Cross-encoders are too slow for initial retrieval but dramatically improve precision on 20 to 50 candidates.
Stage 5: Context assembly. Select the top-k reranked results and assemble them into the LLM prompt. Order matters: place the most relevant documents first for models with position bias, or use the "lost in the middle" mitigation by placing important documents at the beginning and end. Deduplicate near-identical chunks. Ensure total context fits within the model's context window with room for the system prompt and expected response.
Stage 6: Response generation. Send the assembled prompt to the LLM with instructions to answer based only on the provided context. Include source attribution requirements in the system prompt so the model cites which documents support each claim.
Common mistakes: skipping query rewriting for multi-turn conversations, using only vector search without hybrid retrieval, not implementing reranking.
5. What is reranking and why is it critical for RAG quality?
What the interviewer is really asking: Do you understand the difference between bi-encoder and cross-encoder retrieval, and why a two-stage pipeline outperforms single-stage?
Answer framework:
Reranking is a second-stage scoring process that re-evaluates the relevance of retrieved documents using a more powerful model. It exists because the initial retrieval stage makes a fundamental trade-off: bi-encoder models embed queries and documents independently, enabling fast approximate nearest neighbor search across millions of documents, but this independent encoding misses fine-grained query-document interactions.
A cross-encoder reranker processes the query and each candidate document together as a single input, allowing full attention between query tokens and document tokens. This captures nuanced relevance signals like negation, qualification, and semantic relationships that bi-encoders miss. The accuracy improvement is typically 5 to 15 percent in recall@5 on benchmark datasets.
The trade-off is latency: a cross-encoder must score each query-document pair independently, making it O(n) with n candidates. This is why reranking is applied only to the top 20 to 100 candidates from initial retrieval, not the full corpus.
In practice, implement reranking with a model like Cohere Rerank, BGE-reranker-v2, or a fine-tuned cross-encoder:
For production systems, consider Reciprocal Rank Fusion (RRF) as a lightweight alternative or complement. RRF merges ranked lists from multiple retrieval sources (vector search, BM25, metadata filters) without a neural model. It works by assigning each document a score of 1/(k + rank) from each list and summing scores. RRF is fast and surprisingly effective as a first reranking step.
Advanced approach: fine-tune a reranker on your domain data using hard negatives, documents that are topically related but do not answer the specific query. This dramatically improves precision on domain-specific queries compared to general-purpose rerankers.
Common mistakes: not implementing any reranking at all, reranking too many candidates which adds latency, using the same bi-encoder for both retrieval and reranking.
6. How do you evaluate a RAG system? What metrics do you use and how do you build evaluation datasets?
What the interviewer is really asking: Can you move beyond vibes-based evaluation and implement rigorous, reproducible quality measurement?
Answer framework:
RAG evaluation must measure quality at two levels: retrieval quality and generation quality. Each requires different metrics and datasets.
For retrieval evaluation, use information retrieval metrics. Recall@k measures what fraction of relevant documents appear in the top k results. This is the most important retrieval metric since you cannot generate a correct answer from irrelevant context. Precision@k measures what fraction of returned results are relevant. NDCG@k (Normalized Discounted Cumulative Gain) measures whether relevant documents are ranked higher. Mean Reciprocal Rank (MRR) measures the average position of the first relevant result.
For generation evaluation, measure faithfulness (does the answer only contain claims supported by the retrieved context), answer relevance (does the answer address the user's question), and completeness (does the answer cover all aspects of the question). Use LLM-as-judge frameworks like RAGAS, which automate these evaluations by prompting a strong LLM to score each dimension on a scale.
For building evaluation datasets, start with 200 to 500 question-answer pairs covering diverse query types: factual lookup, multi-hop reasoning, comparison, and negation queries. Source these from real user queries if available, domain expert creation, and LLM-generated questions validated by humans. Include adversarial examples where the answer is not in the corpus to test the system's ability to abstain.
Run evaluations on every pipeline change: chunking strategy, embedding model, retrieval parameters, reranking model, and prompt template. Track metrics over time in a dashboard to detect regressions.
Common mistakes: evaluating only end-to-end without measuring retrieval and generation separately, using too few evaluation examples, not including adversarial queries where the system should say "I don't know."
7. How do you reduce hallucinations in a RAG system?
What the interviewer is really asking: Do you understand why RAG systems still hallucinate despite having retrieved context, and what practical techniques mitigate this?
Answer framework:
Hallucination in RAG systems occurs for several distinct reasons, each requiring a different mitigation strategy.
First, irrelevant retrieval. The retrieved documents do not contain the answer, but the LLM generates a plausible-sounding response from its parametric knowledge. Mitigation: improve retrieval quality through better chunking, embedding models, and reranking. Add a relevance threshold and return "I don't have enough information" when no document scores above it.
Second, context window noise. Too many retrieved documents dilute the relevant signal with irrelevant information. The LLM may attend to irrelevant passages and weave them into the answer incorrectly. Mitigation: retrieve fewer but higher-quality chunks. Use aggressive reranking. Implement context compression to extract only the relevant sentences from each chunk.
Third, parametric knowledge override. The LLM's pre-training knowledge conflicts with the retrieved context, and the model follows its prior instead of the provided evidence. Mitigation: use explicit instructions in the system prompt like "Answer ONLY based on the provided context. If the context does not contain the answer, say so." Use models with stronger instruction following. Consider retrieval-focused fine-tuning.
Fourth, reasoning errors. The LLM misinterprets or incorrectly synthesizes information from multiple retrieved chunks. Mitigation: use chain-of-thought prompting where the model first extracts relevant quotes, then reasons over them, then generates the answer. This makes errors detectable.
Additional techniques: implement a post-generation fact-checking step where a separate LLM call verifies each claim against the source documents. Use self-consistency where you generate multiple answers and return only claims that appear consistently. Log and review all cases where the system generates an answer despite low retrieval scores.
Common mistakes: assuming RAG eliminates hallucination entirely, not implementing a confidence threshold for abstention, using overly long contexts that exceed the model's effective attention span.
8. Explain chunking strategies for different document types: PDFs, HTML, code, and structured data.
What the interviewer is really asking: Can you adapt your RAG pipeline to handle real-world messy data, not just clean text files?
Answer framework:
Each document type presents unique challenges for chunking and requires a specialized approach.
PDFs are the most problematic. Raw text extraction loses formatting, tables become garbled, and multi-column layouts interleave content. Use layout-aware parsers like Unstructured, PyMuPDF, or Amazon Textract that preserve document structure. Extract tables separately and convert them to markdown or JSON. For academic papers, chunk by section (abstract, introduction, methods, results) using heading detection. For scanned PDFs, use OCR with confidence scoring and flag low-confidence regions.
HTML documents require cleaning before chunking. Strip navigation, footers, ads, and boilerplate using libraries like trafilatura or readability. Preserve semantic structure: headings define section boundaries, lists should remain intact, and code blocks should not be split. Use heading-aware chunking that creates one chunk per section with the heading hierarchy prepended for context.
Code requires fundamentally different chunking than prose. Split by function or class, not by character count. Include the full function signature, docstring, and body as a single chunk. Prepend the file path and import context. For large functions, split by logical blocks but always include the signature.
Structured data like JSON, CSV, or database records should be serialized into natural language descriptions. A product record {name: "Widget", price: 29.99, category: "Tools"} becomes "Widget is a product in the Tools category priced at $29.99." This allows the embedding model to capture semantic meaning rather than encoding raw field names.
For all types, attach rich metadata including source file, page number, section title, document type, and last modified date. This metadata enables filtered retrieval and source attribution. For more on handling diverse data sources, see our data engineering concepts.
Common mistakes: using the same chunking strategy for all document types, not preserving table structure, splitting code mid-function.
9. How would you implement hybrid search combining dense and sparse retrieval?
What the interviewer is really asking: Do you understand why pure vector search has limitations and how keyword-based retrieval complements it?
Answer framework:
Dense retrieval (vector similarity search) excels at semantic matching, where the query and relevant documents may use completely different words. However, it struggles with exact keyword matching, rare terms, and entity names. If a user searches for "error code ERR_CONNECTION_REFUSED," a dense retriever may return documents about generic connection errors rather than the specific error code.
Sparse retrieval (BM25 or TF-IDF) excels at exact term matching and handles rare terms well because they receive high IDF weights. However, it completely misses semantic similarity when different words describe the same concept.
Hybrid search combines both to leverage their complementary strengths. The implementation has three approaches.
First, parallel retrieval with fusion. Run dense and sparse searches independently, each returning top-k results. Merge results using Reciprocal Rank Fusion (RRF) which assigns each document a score of 1/(k + rank) from each list and sums them. This requires no tuning and works well as a baseline.
Second, native hybrid support. Some vector databases like Weaviate and Qdrant support hybrid search natively, combining BM25 and vector search in a single query with a configurable alpha parameter.
Third, learned sparse representations. Models like SPLADE produce learned sparse vectors that combine the benefits of both approaches. SPLADE assigns term weights using a learned model rather than raw frequency statistics, producing sparse vectors that can be searched with inverted indexes but capture semantic meaning.
The alpha parameter (weight between dense and sparse) should be tuned on your evaluation set. Typical values range from 0.5 to 0.8 favoring dense retrieval. Query-dependent alpha routing is an advanced technique: route exact-match queries (error codes, IDs) toward sparse retrieval and natural language queries toward dense retrieval.
Common mistakes: using only dense search for everything, not tuning the fusion weights, ignoring that some vector databases offer native hybrid search.
10. How do you handle multi-hop questions in a RAG system where the answer requires synthesizing information from multiple documents?
What the interviewer is really asking: Can you design beyond simple single-step retrieval to handle complex reasoning over distributed knowledge?
Answer framework:
Multi-hop questions like "What is the revenue of the company whose CEO wrote the book on innovation?" require chaining multiple retrieval steps because no single document contains the full answer. Standard single-step RAG fails on these queries because the initial retrieval cannot find the final answer document without first identifying the intermediate entity.
Approach 1: Iterative retrieval. Decompose the complex question into sub-questions, retrieve for each, and chain the results. The LLM first generates sub-questions ("Who wrote the book on innovation?" then "What is the revenue of [answer]?"), retrieves for each, and synthesizes the final answer.
Approach 2: Graph-based retrieval. Build a knowledge graph from your documents where entities are nodes and relationships are edges. For multi-hop queries, traverse the graph to find connected information. This is deterministic and faster than iterative LLM calls but requires upfront graph construction.
Approach 3: Agentic RAG. Use an LLM agent with retrieval as a tool. The agent decides when to search, what to search for, and when it has enough information to answer. Frameworks like LangGraph or LlamaIndex agents implement this pattern. The agent can also decide to use different retrieval strategies (keyword search, semantic search, SQL query) based on the sub-question type.
For evaluation, specifically test multi-hop scenarios in your evaluation dataset. Measure not just final answer correctness but also the quality of intermediate retrieval steps and whether the system correctly decomposes complex queries.
Common mistakes: not recognizing multi-hop queries and attempting single-step retrieval, generating too many hops which increases latency and error accumulation, not capping the maximum number of retrieval iterations.
11. How do you handle context window limitations when retrieved documents exceed the model's context length?
What the interviewer is really asking: Can you design practical strategies for fitting relevant information into a finite context window without losing critical details?
Answer framework:
Context window overflow is a critical production challenge. Even with 128K token models, retrieving 50 chunks of 512 tokens each uses 25K tokens before the system prompt and response budget. Blindly stuffing the context window degrades quality because models attend less effectively to middle content (the "lost in the middle" phenomenon).
Strategy 1: Aggressive retrieval filtering. Retrieve more candidates (50 to 100) but aggressively rerank and select only the top 5 to 8. Quality over quantity. A few highly relevant chunks outperform many marginally relevant ones.
Strategy 2: Context compression. Extract only the relevant sentences or passages from each chunk rather than including full chunks. Use an extractive compression model or an LLM to summarize each chunk relative to the query. This can reduce context size by 60 to 80 percent while preserving answer-relevant information.
Strategy 3: Hierarchical summarization. For long documents, create multi-level summaries: document-level, section-level, and paragraph-level. Retrieve at the summary level first to identify relevant documents, then drill into the relevant sections for detail.
Strategy 4: Map-reduce over chunks. Process each chunk independently with the query, extract relevant information, then combine the extracted information into a final prompt. This works well for questions that aggregate information across many sources ("Summarize all customer complaints about feature X").
Strategy 5: Position-aware context ordering. Place the most relevant chunks at the beginning and end of the context (not the middle) based on research showing models attend more to these positions. Interleave high-relevance and supporting chunks.
For production systems, implement dynamic context budgeting: allocate a maximum token budget for context based on the model's window and expected response length, then select and compress chunks to fit within that budget.
Common mistakes: blindly stuffing the full context window, not accounting for the lost-in-the-middle effect, not leaving enough tokens for the model's response.
12. How would you design a RAG system for a multi-tenant SaaS application with data isolation requirements?
What the interviewer is really asking: Can you handle the infrastructure and security complexity of serving multiple customers with separate knowledge bases from a shared system?
Answer framework:
Multi-tenant RAG requires data isolation at every layer: document storage, vector index, and retrieval. A tenant's data must never leak into another tenant's results.
Approach 1: Namespace isolation. Use a single vector database instance with per-tenant namespaces or collections. Each tenant's embeddings are stored in their own namespace. At query time, filter retrieval to the requesting tenant's namespace. This is the most cost-effective approach and works well for up to hundreds of tenants. Pinecone namespaces and Qdrant collections support this natively.
Approach 2: Metadata filtering. Store all tenants in a single collection with a tenant_id metadata field. Apply a tenant_id filter on every retrieval query. This is simpler to manage but requires careful implementation to prevent filter bypass and may have performance implications for large collections.
Approach 3: Dedicated instances. For tenants with strict compliance requirements (healthcare, government), provision dedicated vector database instances per tenant. This provides the strongest isolation but is the most expensive. Use this for enterprise-tier customers.
Beyond isolation, consider per-tenant customization: different chunking strategies based on their document types, per-tenant embedding model fine-tuning for large customers, custom prompt templates per tenant, and tenant-specific evaluation metrics.
For scalability, implement a tenant routing layer that directs requests to the appropriate storage tier. Cache hot tenant indexes in memory for low-latency retrieval. Monitor per-tenant usage for capacity planning and cost allocation.
Address access control within a tenant: row-level security where different users within a tenant can access different documents. Apply document-level permission filters during retrieval based on the user's role.
Common mistakes: relying solely on metadata filters without defense-in-depth, using a single embedding model when tenants have very different domains, not monitoring for cross-tenant data leakage.
13. What are the key differences between naive RAG, advanced RAG, and modular RAG architectures?
What the interviewer is really asking: Do you understand the evolution of RAG architectures and can you design beyond the basic retrieve-and-generate pattern?
Answer framework:
Naive RAG is the basic three-step pipeline: index documents, retrieve top-k by vector similarity, generate an answer. It works for prototypes and simple use cases but has well-known limitations: poor retrieval for complex queries, no handling of irrelevant results, and vulnerability to the quality of the initial chunking.
Advanced RAG adds pre-retrieval and post-retrieval optimization stages. Pre-retrieval enhancements include query rewriting, query expansion via HyDE, query routing to select the appropriate index or retrieval strategy, and step-back prompting to generalize overly specific queries. Post-retrieval enhancements include reranking with cross-encoders, context compression, and relevance filtering with confidence thresholds. Advanced RAG is what most production systems should target.
Modular RAG decomposes the pipeline into independently configurable and swappable modules: a routing module that directs queries to different retrieval backends based on query type (SQL for structured questions, vector search for semantic questions, web search for recent events), an active retrieval module that decides whether retrieval is even needed (some questions can be answered from the model's knowledge alone), a memory module for multi-turn conversations, and evaluation modules that assess retrieval and generation quality in real-time and trigger fallback strategies.
The progression reflects increasing architectural maturity:
| Aspect | Naive RAG | Advanced RAG | Modular RAG |
|---|---|---|---|
| Query processing | Raw query | Rewritten, expanded | Classified, routed |
| Retrieval | Single vector search | Hybrid search | Multi-source, adaptive |
| Post-processing | None | Reranking, compression | Dynamic, quality-gated |
| Failure handling | None | Confidence threshold | Fallback chains |
| Architecture | Monolithic | Enhanced pipeline | Pluggable modules |
For a senior engineer building a production system, target advanced RAG as the minimum and adopt modular patterns for complex multi-domain applications. See our RAG architecture guide for implementation details.
Common mistakes: deploying naive RAG in production and wondering why quality is poor, over-engineering with modular RAG when advanced RAG suffices, not measuring the incremental value of each enhancement.
14. How do you handle real-time data updates in a RAG system where the knowledge base changes frequently?
What the interviewer is really asking: Can you design an ingestion pipeline that keeps the knowledge base current without causing retrieval quality degradation or excessive costs?
Answer framework:
The challenge is that most RAG tutorials assume a static corpus, but production systems need to handle continuous document additions, updates, and deletions without downtime or quality degradation.
For the ingestion pipeline, implement a change detection layer that monitors data sources for new and modified documents. Use webhooks, file system watchers, database CDC (Change Data Capture), or scheduled polling depending on the source. When a change is detected, process only the affected documents rather than re-indexing everything.
For document updates, you must handle both the embedding and metadata layers. When a document is updated, re-chunk the new version, generate new embeddings, and atomically replace the old chunks with new ones in the vector database. Use document version IDs to ensure consistency.
For high-frequency updates, use a two-tier index architecture. A main index contains the bulk of the corpus and is updated in hourly or daily batch jobs. A real-time index contains recently changed documents and is updated within seconds. At query time, search both indexes and merge results. Periodically compact the real-time index into the main index.
Address embedding model updates: when you switch to a new embedding model, you must re-embed the entire corpus because embeddings from different models are not compatible. Plan for this by maintaining the ability to do a full re-index without downtime. Use blue-green deployment for the vector index.
For time-sensitive applications, add a recency bias to the retrieval scoring. Boost documents updated recently, or implement time-decay functions that gradually reduce older documents' relevance scores.
Common mistakes: re-indexing the entire corpus on every change, not handling document deletions which leaves stale chunks in the index, not planning for embedding model migration.
15. You are tasked with building a RAG system for a company's internal knowledge base. Walk through your design from requirements gathering to production deployment.
What the interviewer is really asking: Can you translate RAG concepts into a concrete, end-to-end system design that addresses real-world concerns like cost, latency, observability, and iteration?
Answer framework:
Phase 1: Requirements and data audit. Interview stakeholders to understand the use case: what questions will users ask, what documents contain the answers, what accuracy is acceptable, and what latency is required. Audit the data sources: catalog document types (Confluence, Google Docs, Slack, PDFs), estimate corpus size, assess content quality, and identify access control requirements.
Phase 2: MVP pipeline design. Build the simplest pipeline that could work. Use a hosted embedding model (OpenAI or Cohere) and a managed vector database (Pinecone or Weaviate Cloud). Implement basic recursive chunking with 512-token chunks and 64-token overlap. Use a simple prompt template with the top 5 retrieved chunks. Deploy behind an API with basic logging.
Phase 3: Evaluation infrastructure. Before optimizing anything, build the evaluation pipeline. Create a golden test set of 200 questions with ground-truth answers and relevant document IDs. Implement automated retrieval and generation metrics. Run every pipeline change through this evaluation before deploying.
Phase 4: Iterative optimization. Based on evaluation results, optimize the weakest component. If retrieval recall is low, try different chunking strategies, embedding models, or hybrid search. If generation quality is low despite good retrieval, optimize prompts, add reranking, or try a different LLM. Track cost per query and latency percentiles alongside quality metrics.
Phase 5: Production hardening. Add observability: log every query, retrieved documents, and generated answer. Implement tracing with tools like LangSmith or Phoenix to debug quality issues. Add rate limiting and cost controls. Implement caching for repeated queries. Set up alerting on quality metric degradation.
Phase 6: Feedback loop. Implement user feedback mechanisms: thumbs up/down on answers, ability to flag incorrect responses. Use this feedback to continuously improve the evaluation dataset and fine-tune retrieval and generation. The feedback loop is what separates good RAG systems from great ones.
Estimate costs: for a 1 million document corpus with 10 million chunks, embedding costs approximately $50 to $200 as a one-time expense, vector storage is $50 to $200 per month, and per-query costs include embedding ($0.0001), retrieval (minimal), and LLM generation ($0.01 to $0.10 depending on model). For a detailed cost breakdown, check our pricing guide.
Common mistakes: over-engineering the MVP before validating the use case, optimizing without evaluation infrastructure, not building a feedback loop.
How to Practice RAG Interview Questions
Build at least two RAG systems from scratch: one simple pipeline with a small corpus and one production-grade system with a realistic document set. Implement retrieval evaluation with real metrics, not just eyeballing results. Experiment with different chunking strategies, embedding models, and reranking approaches to develop intuition for what works.
Study the MTEB leaderboard and understand why different embedding models perform differently on different tasks. Read the original RAG paper by Lewis et al. and follow up with recent papers on advanced RAG patterns.
Practice articulating trade-offs: when would you choose smaller chunks over larger ones, when is hybrid search worth the complexity, and when should you use a cross-encoder reranker versus RRF. Be prepared to sketch the full pipeline on a whiteboard and discuss failure modes at each stage.
For structured preparation, explore our system design interview guide and the AI engineering learning path. Practice with real-world datasets from Hugging Face and evaluate your systems with the RAGAS framework.
Common Mistakes to Avoid
-
Building RAG without evaluation infrastructure. If you cannot measure retrieval recall and generation faithfulness, you cannot improve your system. Build evaluation before optimization.
-
Using the same chunking strategy for all document types. PDFs, code, tables, and prose require different chunking approaches. One size does not fit all.
-
Ignoring hybrid search. Pure vector search misses exact keyword matches. Combining dense and sparse retrieval is a simple change with significant impact.
-
Skipping reranking. The difference between bi-encoder retrieval and cross-encoder reranked results is substantial. Always rerank your top candidates.
-
Not handling the "I don't know" case. When retrieved documents are irrelevant, the system should abstain rather than hallucinate. Implement confidence thresholds.
-
Over-stuffing the context window. More context does not mean better answers. A few highly relevant chunks outperform many loosely related ones.
-
Treating RAG as a one-time setup. Production RAG systems require continuous monitoring, evaluation, and iteration based on user feedback.
Related Resources
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.