INTERVIEW_QUESTIONS
Data Modeling Interview Questions for Senior Engineers (2026)
15 advanced data modeling interview questions with detailed answer frameworks covering ER diagrams, normalization, denormalization, dimensional modeling, star and snowflake schemas, slowly changing dimensions, and production schema design patterns used at top tech companies.
Why Data Modeling Is a Core Interview Topic for Senior Engineers
Data modeling is the foundation of every software system. Before you write a single line of application code, you are making decisions about how data is structured, related, and accessed. In senior engineering interviews, data modeling questions reveal whether you can think beyond the immediate feature request and design schemas that serve the system's long-term needs for performance, flexibility, and correctness.
At companies like Google and Amazon, poor data modeling decisions made early in a product's lifecycle become deeply embedded technical debt. A denormalization choice that speeds up reads today can make a critical migration impossible two years from now. A missing index on a junction table can cause a query that works fine at 10,000 rows to bring down production at 10 million. Senior engineers are expected to anticipate these trajectories, and interviewers use data modeling questions to test that ability.
The questions in this guide cover the full spectrum of data modeling, from foundational concepts like entity-relationship diagrams and normalization through to advanced topics like dimensional modeling for analytics, slowly changing dimensions in data warehouses, and the pragmatic trade-offs of denormalization in distributed systems. Each question includes the real intent behind what the interviewer is asking, a structured answer framework that demonstrates depth, and follow-up considerations that push into staff-level territory.
Whether you are preparing for a system design interview or a deep-dive technical round, mastering data modeling will set you apart from candidates who can only reason about code in isolation. For foundational context, review our guide on how databases work and our comparison of SQL vs NoSQL databases.
Question 1: Walk me through how you would design an ER diagram for a multi-tenant SaaS application.
What the interviewer is really asking: Can you decompose a complex business domain into entities, attributes, and relationships while accounting for multi-tenancy concerns like data isolation, query performance, and schema evolution? They want to see if you can move beyond a single-user mental model and reason about shared infrastructure.
Answer framework:
Start by identifying the core entities: Tenant, User, Subscription, and the domain-specific entities for the SaaS application (for example, Project, Task, Comment in a project management tool). Establish the cardinality of each relationship explicitly: a Tenant has many Users (one-to-many), a User belongs to one Tenant, a Project belongs to one Tenant and has many Tasks.
Discuss the three multi-tenancy strategies and their ER implications:
-
Shared database, shared schema with a
tenant_idforeign key on every table. This is the most common approach. Every query must include a tenant filter, and you need composite indexes that lead withtenant_id. -
Shared database, separate schemas. Each tenant gets its own PostgreSQL schema. The ER diagram is identical per tenant, but you avoid the
tenant_idcolumn. Trade-off: schema migrations must be applied to every tenant schema. -
Separate databases per tenant. Complete isolation but significant operational overhead.
For the shared-schema approach, your ER diagram should show:
Note the deliberate denormalization: tenant_id appears on Task and Comment even though it could be derived through the Project relationship. This is intentional because every query in a multi-tenant system must efficiently filter by tenant, and joining through multiple tables to resolve tenant ownership is unacceptable at scale.
Discuss indexing strategy: composite indexes like (tenant_id, project_id) and (tenant_id, assignee_id) ensure that tenant-scoped queries use index seeks rather than scans. Mention row-level security in PostgreSQL as an enforcement mechanism.
Follow-up considerations: How would you handle cross-tenant features like shared templates? How does your ER model change if you need to support tenant hierarchies (parent organizations with sub-accounts)?
Question 2: Explain normalization through Third Normal Form. When would you stop at 2NF or go beyond 3NF?
What the interviewer is really asking: Do you understand normalization as a set of principles for eliminating data anomalies, not just a textbook exercise? They want to hear you reason about when strict normalization serves the system and when it becomes counterproductive.
Answer framework:
Walk through each normal form with a concrete example. Start with an unnormalized orders table:
First Normal Form (1NF): Eliminate repeating groups. Each row should contain atomic values, and there should be no repeating columns like product_1, product_2. Extract products into a separate table with a junction table order_items.
Second Normal Form (2NF): Eliminate partial dependencies. If a composite key is (order_id, product_id), then customer_name depends only on order_id, not on the full composite key. Move customer data to the orders table or a customers table.
Third Normal Form (3NF): Eliminate transitive dependencies. If customer_email determines customer_name, and order_id determines customer_email, then customer_name transitively depends on order_id through customer_email. Extract customer information into a dedicated customers table.
When to stop at 2NF: In high-throughput event logging systems where write performance is critical and you rarely query by the partially dependent attributes. For example, a clickstream table where you store user_id, session_id, page_url, user_country even though user_country depends only on user_id.
When to go beyond 3NF (BCNF, 4NF, 5NF): When dealing with complex key constraints in scheduling systems, academic enrollment databases, or any domain where multi-valued dependencies can cause subtle data anomalies. In practice, most production systems target 3NF for transactional data and deliberately denormalize for read-heavy analytics.
The key insight to convey: normalization is about protecting data integrity. Every denormalization is a decision to trade integrity guarantees for performance, and senior engineers make that trade explicitly, not accidentally.
Question 3: You have a read-heavy application where 95% of operations are reads. How do you decide what to denormalize?
What the interviewer is really asking: Can you make principled denormalization decisions based on access patterns rather than guessing? They want to see a methodical approach, not a blanket "denormalize everything for reads."
Answer framework:
Denormalization should be driven by measured query patterns, not intuition. Present a structured decision process:
-
Identify the top 10 queries by frequency and latency. Use query logs, slow query reports, and APM data to find the actual bottlenecks.
-
Analyze join costs. For each expensive query, determine whether the cost comes from joining across tables, aggregating large datasets, or scanning without indexes. Denormalization only helps with join costs.
-
Calculate the write amplification. Every piece of denormalized data must be updated in multiple places. If a customer's name appears in both the
customerstable and a denormalizedorders_with_customertable, every name change requires updating both. Quantify how often the denormalized fields change. -
Choose the denormalization strategy:
- Precomputed columns: Add a
total_amountcolumn toordersinstead of computingSUM(order_items.unit_price * quantity)on every read. - Materialized views: Create a materialized view that joins frequently accessed tables. Refresh on a schedule or via triggers.
- Embedded documents: In document stores, embed related data directly. Store the author's name and avatar URL inside each blog post document.
- Cache tables: Create dedicated read-optimized tables populated by background jobs or change data capture.
- Precomputed columns: Add a
-
Implement consistency mechanisms. Use database triggers, application-level event handlers, or CDC pipelines to keep denormalized data in sync.*
Emphasize that denormalization is a spectrum. You can start with materialized views (easy to maintain, eventual consistency) before committing to physically denormalized tables (harder to maintain, but faster reads).
Question 4: Design a schema for a social media platform where users can follow each other, create posts, and like posts.
What the interviewer is really asking: Can you model graph-like relationships (follows) alongside content entities (posts) and interaction entities (likes) while considering fan-out patterns and query efficiency? This tests your ability to handle asymmetric relationships and high-cardinality associations.
Answer framework:
Start with the core entities and relationships:
Discuss the critical design decisions:
Follows table: The follows table models a directed graph. The composite primary key (follower_id, following_id) prevents duplicate follows and enables efficient lookups in both directions. Add a reverse index on (following_id, follower_id) to efficiently answer "who follows user X?"
Fan-out problem for feeds: The most expensive query is generating a user's timeline: "show me recent posts from everyone I follow." A naive approach joins follows and posts:
This query becomes expensive when users follow thousands of accounts. Discuss two strategies: fan-out on write (precompute timelines into a feed table when a post is created) and fan-out on read (compute the timeline at read time with aggressive caching). Most production systems use a hybrid: fan-out on write for regular users, fan-out on read for celebrity accounts with millions of followers.
Likes count denormalization: Add a like_count column to posts and increment it via a trigger or application logic when a like is inserted. This avoids COUNT(*) queries on the likes table for every post render.*
For deeper context on modeling these relationships at scale, see our guide on designing a social media feed and the underlying concepts of graph data modeling.
Question 5: What is dimensional modeling and when would you use it instead of traditional ER modeling?
What the interviewer is really asking: Do you understand that OLTP and OLAP workloads have fundamentally different data modeling requirements? Can you explain why the same normalization principles that protect transactional data become liabilities in analytical workloads?
Answer framework:
Dimensional modeling is a schema design technique optimized for analytical queries, developed by Ralph Kimball. It organizes data into fact tables (measurements and metrics) and dimension tables (descriptive context for those measurements). The key insight is that analytical queries follow a predictable pattern: filter by dimensions, aggregate facts.
When to use ER modeling (3NF):
- OLTP systems where writes are frequent and must be consistent
- Applications where the data access patterns are varied and unpredictable
- Systems where storage efficiency matters and data redundancy must be minimized
When to use dimensional modeling:
- Data warehouses and analytical databases
- Reporting systems where query patterns are known in advance
- Business intelligence platforms where query performance on aggregations is critical
- When business users need to write their own queries (dimensional models are more intuitive)
The fundamental difference: ER modeling minimizes redundancy to protect data integrity during writes. Dimensional modeling embraces redundancy to maximize query performance during reads. A fully normalized ER model for a retail system might have 15 tables with complex joins. The equivalent dimensional model might have one fact table and four dimension tables, with every query requiring at most one join per dimension.
The dimensional model is dramatically simpler to query. A question like "What was total revenue by product category and store region last quarter?" requires two joins in the dimensional model versus five or more joins in the normalized model.
For more on when to choose different database paradigms, see our comparison of OLTP vs OLAP databases.
Question 6: Explain the star schema and snowflake schema. When would you choose one over the other?
What the interviewer is really asking: Can you articulate the trade-offs between query simplicity and storage efficiency in dimensional models? Do you understand how the choice impacts query performance, ETL complexity, and end-user experience?
Answer framework:
Star schema: The fact table sits at the center, directly connected to denormalized dimension tables. Each dimension table contains all descriptive attributes, even if that means data redundancy within the dimension.
Snowflake schema: Dimension tables are normalized into sub-dimension tables. The product dimension splits into dim_product, dim_category, dim_subcategory, and dim_supplier.
When to choose star schema:
- When query performance is the top priority and the query engine benefits from fewer joins
- When business analysts write their own queries (simpler to understand)
- When dimension tables are small enough that redundancy has negligible storage cost
- Most modern columnar data warehouses (Snowflake, BigQuery, Redshift) are optimized for star schemas
When to choose snowflake schema:
- When dimension data changes frequently and you want to update it in one place
- When dimensions are very large and redundancy causes meaningful storage overhead
- When you need to enforce referential integrity within dimension hierarchies
- When ETL processes benefit from loading normalized dimension data incrementally
In practice, the star schema dominates modern data warehousing. The storage savings of snowflaking are negligible with modern columnar compression, and the query simplicity of star schemas is a significant productivity advantage for analytics teams.
Question 7: What are Slowly Changing Dimensions (SCDs) and how do you implement Type 1, Type 2, and Type 3?
What the interviewer is really asking: Do you understand the temporal dimension of data modeling? When a customer changes their address or a product changes its category, how do you preserve historical accuracy in reports while keeping the schema maintainable?
Answer framework:
Slowly Changing Dimensions address the fundamental problem that dimension attributes change over time, but historical facts should be reported against the dimension values that were true when the fact occurred.
Type 1: Overwrite. Simply update the dimension record. Historical values are lost.
Use Type 1 when: historical tracking is not needed (e.g., fixing a typo in a product name), or for attributes where the current value is always correct (e.g., a customer's current loyalty tier for segmentation).
Type 2: Add a new row. Create a new dimension record with an effective date range. The old record is marked as inactive.
Use Type 2 when: you need accurate historical reporting (e.g., "What were sales by customer region last year?" should reflect where customers actually lived last year).
Type 3: Add a column. Store both the current and previous value in the same row.
Use Type 3 when: you only need to track one level of history, and the business requirement is to compare "current vs. previous" rather than a full history.
Hybrid approaches: In practice, many systems use Type 1 for non-critical attributes (like a display name) and Type 2 for business-critical attributes (like region or segment) within the same dimension table. The key is to align the SCD strategy with the business reporting requirements.
For related patterns in event-driven architectures, see our guide on event sourcing.
Question 8: How would you model a product catalog that supports dynamic attributes across different product categories?
What the interviewer is really asking: Can you handle the tension between relational rigidity and business flexibility? Products in different categories have different attributes (a laptop has RAM and screen size; a shirt has size and color), and the business wants to add new attributes without schema changes.
Answer framework:
This is one of the classic schema design challenges. Present multiple approaches with their trade-offs:
Approach 1: Entity-Attribute-Value (EAV)
Pros: Completely flexible, no schema changes needed. Cons: Queries are painful (pivot queries to reconstruct a product), no type safety, indexing is difficult, and reporting becomes complex. EAV is widely considered an anti-pattern in relational databases for good reason.
Approach 2: JSON/JSONB column
Pros: Flexible, good query support in PostgreSQL and MySQL 8+, can be indexed. Cons: No schema enforcement within the JSON (can be mitigated with JSON Schema validation or CHECK constraints), harder to do cross-product analytics on specific attributes.
Approach 3: Category-specific tables (table-per-type inheritance)
Pros: Full type safety, efficient queries within a category, clean indexing. Cons: Adding a new category requires a schema change, cross-category queries require UNION ALL, application code must know which attribute table to join.
Recommended approach for most systems: Use the JSONB column approach with validation. It provides the best balance of flexibility and queryability. Add computed/generated columns for the most frequently filtered attributes to enable standard indexing:
This gives you the flexibility of schema-less attributes with the query performance of typed columns for the attributes that matter most.
Question 9: How do you model hierarchical data in a relational database? Compare at least three approaches.
What the interviewer is really asking: Hierarchical data (org charts, category trees, file systems, comment threads) is notoriously difficult to model in relational databases. Can you reason about the performance characteristics of different approaches for reads, writes, and tree mutations?
Answer framework:
Approach 1: Adjacency List
Pros: Simple schema, easy inserts and moves (just update parent_id). Cons: Fetching an entire subtree requires recursive queries (CTEs), which can be slow for deep trees. Fetching ancestors requires walking up the tree.
Approach 2: Materialized Path (Path Enumeration)
Pros: Finding all descendants is a simple LIKE query (WHERE path LIKE '/1/5/%'), finding ancestors is trivial (parse the path). Cons: Moving a subtree requires updating the path of every descendant. Path length is bounded by column size.
Approach 3: Nested Sets
Pros: Finding all descendants is a single range query with no recursion, extremely fast reads. Cons: Insertions and moves require renumbering large portions of the tree, making writes very expensive. Best for read-heavy, rarely-modified hierarchies.
Approach 4: Closure Table
Pros: Fast reads for any tree operation (ancestors, descendants, depth), supports multiple hierarchies. Cons: The closure table can grow large (O(n^2) in the worst case for a flat tree), and inserts require adding multiple rows.
Recommendation by use case:
- Comment threads with shallow nesting: Adjacency List (simple, CTEs handle 3-5 levels easily)
- File system or URL paths: Materialized Path (natural fit, fast prefix queries)
- Category taxonomies that rarely change: Nested Sets (blazing read performance)
- Complex organizational hierarchies with frequent restructuring: Closure Table (best balance of read and write performance)
Question 10: You are designing a schema for a booking system (hotel, restaurant, or appointment). What are the critical modeling decisions?
What the interviewer is really asking: Booking systems are deceptively complex because they involve temporal data, concurrency, state machines, and the interplay between availability and reservations. Can you identify the non-obvious modeling challenges?
Answer framework:
The critical entities are: Resource (room, table, time slot), Booking, Customer, and potentially BlockedTime (maintenance, holidays).
Critical modeling decisions:
- Preventing double bookings. This is the core challenge. Use an exclusion constraint in PostgreSQL:
This database-level constraint guarantees no two active bookings for the same resource overlap in time, regardless of application-level race conditions.
-
Booking status as a state machine. Model the allowed transitions:
pending -> confirmed -> checked_in -> completedandpending -> cancelled,confirmed -> cancelled. Enforce these transitions in application logic or database triggers, not just in the schema. -
Recurring bookings. Do you store each occurrence as a separate row (materialized) or store a recurrence rule and generate occurrences at query time (virtual)? Materialized is simpler to query and enforce constraints on; virtual saves storage for long-running recurring bookings.
-
Timezone handling. Always store timestamps in UTC. Store the resource's timezone separately. A hotel room in Tokyo and one in New York both have bookings stored in UTC, but "check-in at 3 PM" means different absolute times.
-
Buffer time. Hotel rooms need cleaning between guests; restaurant tables need turnover time. Model this as a configurable buffer per resource type and include it in the overlap constraint.
This pattern connects to broader system design concepts around concurrency control and temporal data management.
Question 11: How do you design a schema that supports soft deletes without compromising query performance or data integrity?
What the interviewer is really asking: Soft deletes are ubiquitous in production systems, but they introduce subtle problems with unique constraints, foreign keys, and query complexity. Do you understand these pitfalls and know how to mitigate them?
Answer framework:
The naive approach is adding a deleted_at timestamp column and filtering every query with WHERE deleted_at IS NULL. This works initially but creates several problems at scale:
Problem 1: Unique constraints break. If a user deletes their account with email jane@example.com and a new user tries to register with the same email, the unique constraint on email blocks them because the soft-deleted row still exists.
Solution: Use a partial unique index:
Problem 2: Every query must remember the filter. Developers will inevitably forget to add WHERE deleted_at IS NULL, causing bugs where deleted data appears in the application.
Solution: Create a view for active records and use it as the default access path:
Or use row-level security policies in PostgreSQL to automatically filter deleted rows for application roles.
Problem 3: Table bloat. Soft-deleted records accumulate and slow down queries even with the deleted_at filter, because the database still scans past them (or maintains them in indexes).
Solution: Partition the table by deletion status, or implement a periodic archival process that moves soft-deleted records older than a retention period to an archive table.
Problem 4: Cascading soft deletes. When a user is soft-deleted, should their posts also be soft-deleted? The database's foreign key ON DELETE actions don't trigger for soft deletes; you must implement cascading logic in the application.
The mature pattern: use an is_active boolean with a partial index for the common case, a deleted_at timestamp for auditing, and a scheduled archival job for long-term data management.
Question 12: What are the trade-offs between using UUIDs and auto-incrementing integers as primary keys?
What the interviewer is really asking: This seemingly simple question probes your understanding of index performance, distributed system coordination, data migration, and security. Senior engineers should have a nuanced view beyond "UUIDs are random so they are bad for indexes."
Answer framework:
Auto-incrementing integers:
- Pros: Compact (4-8 bytes), sequential insertion is B-tree friendly (no page splits), human-readable for debugging, natural ordering by creation time.
- Cons: Require centralized coordination (a single sequence or auto-increment counter), expose ordering information (user_id 1001 was created before 1002), cannot be generated client-side, merge conflicts during database migrations or multi-region setups.
UUIDs (v4 random):
- Pros: Can be generated anywhere without coordination (client, API server, microservice), no information leakage about ordering or volume, globally unique across systems.
- Cons: Large (16 bytes, 36 characters as string), random insertion causes B-tree page splits and index fragmentation, not human-readable.
UUIDv7 (time-ordered): The modern answer. UUIDv7 (RFC 9562) embeds a timestamp in the most significant bits, making them time-ordered while preserving global uniqueness. This gives you the coordination-free generation of UUIDs with the sequential insertion performance of integers.
Recommended strategy for most systems:
- Use UUIDv7 as the primary key for new systems that will operate across multiple services or regions.
- Use auto-incrementing integers for internal tables that are never exposed to external systems and where single-database simplicity is sufficient.
- Never expose auto-incrementing IDs in URLs or APIs. Use a separate public-facing identifier (slug, UUID, or encoded ID) even if the internal PK is an integer.
- If using UUID as PK, store it as a native
UUIDtype (16 bytes binary), never asVARCHAR(36).
For deeper coverage of primary key strategies in distributed systems, see our interview questions on distributed databases and concepts around database sharding.
Question 13: How do you model many-to-many relationships that have their own attributes? Give an example with a real-world use case.
What the interviewer is really asking: Can you move beyond textbook many-to-many junction tables and model associative entities that carry business-critical data? This tests whether you think about relationships as first-class domain concepts.
Answer framework:
A simple many-to-many (students and courses) becomes an associative entity (enrollment) when the relationship itself has attributes: enrollment_date, grade, status, section_number.
Key design decisions in associative entities:
-
Give it its own primary key. An enrollment is a first-class entity that other tables may need to reference (e.g.,
assignment_submissionsreferencesenrollment_id, not the composite of student and course). -
Capture temporality. Students can take the same course in different semesters. The unique constraint should include the temporal dimension.
-
Model the lifecycle. An enrollment has states (enrolled, waitlisted, dropped, completed) and transitions. Store timestamps for each state change for auditing.
Real-world example: E-commerce order items. The relationship between orders and products is many-to-many, but the order_items table carries critical attributes:
Note that unit_price is stored on the order item, not looked up from the product. This is deliberate: the price at the time of purchase must be preserved regardless of future product price changes. This is a form of point-in-time snapshotting that connects to the Slowly Changing Dimension concepts discussed in Question 7.
Question 14: How do you approach schema evolution in a production system with zero downtime?
What the interviewer is really asking: Schema changes in production are one of the riskiest operations a senior engineer performs. Can you execute DDL changes without locking tables, breaking application compatibility, or losing data?
Answer framework:
The core principle is expand and contract: first expand the schema to support both the old and new format, deploy application changes that use the new format, then contract the schema by removing the old format.
Example: Renaming a column from full_name to display_name.
Phase 1 (Expand): Add the new column.
Phase 2 (Dual-write): Deploy application code that writes to both columns.
Phase 3 (Switchover): Deploy application code that reads from the new column.
Phase 4 (Contract): Drop the old column after confirming no code reads from it.
Critical operational considerations:
-
Lock-free DDL. In PostgreSQL,
ALTER TABLE ADD COLUMNwith a default value acquires anACCESS EXCLUSIVElock in versions before 11. UseADD COLUMNwithout a default, then backfill. In MySQL, usept-online-schema-changeorgh-ostfor lock-free alterations. -
Backfill in batches. Never
UPDATE users SET display_name = full_nameon a large table in a single transaction. Process in batches of 1,000-10,000 rows with small pauses to avoid overwhelming the database.
-
Backward-compatible migrations only. Every migration must be safe to run while the previous version of the application is still serving traffic. This means: never drop columns or tables in the same deploy that removes the code using them. Never add NOT NULL constraints without a default until all rows have been backfilled.
-
Feature flags for data model changes. Gate the application's use of new columns behind feature flags so you can roll back the application independently of the database migration.
For more on zero-downtime deployments, see our guide on deployment strategies and database migration patterns.
Question 15: You need to design a schema for a permissions system that supports roles, resource-level permissions, and permission inheritance. Walk me through your approach.
What the interviewer is really asking: Authorization systems are a superset of data modeling challenges: hierarchical data (role inheritance), many-to-many relationships with attributes (permissions on resources), and query performance requirements (every API request checks permissions). Can you design a schema that is correct, performant, and maintainable?
Answer framework:
Start with the core authorization model: Role-Based Access Control (RBAC) with resource-level granularity.
Key design decisions:
-
Scoped role assignments. A user might be an
adminof Project A but only aviewerof Project B. Theuser_rolestable includesresource_typeandresource_idto scope the assignment. -
Role hierarchy with closure table. For efficient "does user X have permission Y?" checks, precompute the transitive closure of the role hierarchy. If
admininherits fromeditorwhich inherits fromviewer, the closure table stores all three relationships foradmin. -
Permission check query. The permission check must be fast because it runs on every API request:
-
Caching strategy. Materialize each user's effective permissions into a cache (Redis or application memory) and invalidate when role assignments or role definitions change. The permission check in the hot path should be a cache lookup, not a database query.
-
Audit trail. Every permission grant and revocation should be logged in an append-only audit table with the acting user, timestamp, and the before/after state.
For more on designing authorization systems at scale, see our system design interview guide and interview questions on access control.
How to Practice
Data modeling skills develop through deliberate practice on realistic problems. Here is a structured approach:
-
Model real systems. Take applications you use daily (Spotify, Uber, Slack) and design their core schemas from scratch. Focus on the three hardest tables, not the entire database.
-
Practice the trade-off conversation. For every schema decision, articulate what you are gaining and what you are giving up. Write it down: "I am denormalizing the customer name onto the order because read frequency is 100x write frequency and the join adds 15ms to the p99."
-
Use a real database. Write actual
CREATE TABLEstatements and run queries against them. Loading even a small amount of test data will reveal problems that diagramming alone misses: missing indexes, awkward join paths, constraint violations. -
Study production schemas. Open-source projects like GitLab, Discourse, and Mastodon publish their database schemas. Read them to see how experienced teams model complex domains.
-
Time-box your design. In interviews, you typically have 20-30 minutes for a schema design. Practice completing a full design (entities, relationships, key indexes, one complex query) within that window.
For structured interview preparation with feedback, check out our interview preparation plans and system design interview guide.
Common Mistakes to Avoid
-
Over-normalizing from the start. Do not reflexively normalize to 3NF without considering the actual query patterns. Sometimes the right answer is a moderately denormalized schema with clear consistency rules.
-
Ignoring the write path. Many candidates design schemas optimized for reads and forget to discuss how writes work. Every denormalization creates a write amplification cost. Quantify it.
-
Forgetting about NULL semantics. NULLs in SQL have surprising behavior in comparisons, aggregations, and unique constraints. Design your schema to minimize columns that can be NULL, and when NULLs are necessary, understand their implications on queries.
-
Using generic column names. Columns named
type,status, orvaluewithout context make schemas unreadable. Usepayment_status,account_type,metric_value. Your schema is documentation. -
Skipping indexes in the design. The schema is not complete without its index strategy. For every query you describe in the interview, identify which index serves it. If you cannot name the index, the query will table-scan.
-
Ignoring data volume. A schema that works for 10,000 rows can fail at 10 million. Always state your assumptions about data volume and growth rate, and design accordingly.
-
Not discussing constraints. Foreign keys, CHECK constraints, UNIQUE constraints, and exclusion constraints are part of the data model. They prevent bugs that application logic alone cannot catch. Mention them explicitly.
-
Designing in isolation. Data models exist to serve applications. Always connect your schema design back to the access patterns, consistency requirements, and operational characteristics of the system it supports.
GO DEEPER
Master this topic in our 12-week cohort
Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.