How to Learn Database Internals

A structured guide to learning database internals — covering storage engines, indexing, query execution, transactions, and how databases actually work under the hood.

databasesinternalsstorage-enginesindexinglearning-path

How to Learn Database Internals

Most software engineers use databases every day without understanding how they work internally. They write SQL queries, choose between PostgreSQL and MongoDB, and trust that the database handles storage, indexing, and transactions correctly. This works fine until it does not — until a query is mysteriously slow, a transaction produces unexpected results, or you need to choose between databases for a system design decision.

Understanding database internals transforms you from a database user into a database-literate engineer who can make informed architectural decisions, debug complex performance issues, and design data-intensive systems with confidence.

Why Learn Database Internals

Better engineering decisions: When you understand how B-trees, LSM trees, and hash indexes work, you can choose the right database for your workload instead of defaulting to PostgreSQL for everything. You can design schemas and queries that work with the storage engine instead of against it.

Performance debugging: Production database performance issues are among the most impactful problems engineers face. Understanding query execution plans, buffer pool management, lock contention, and I/O patterns lets you diagnose and fix problems that stump engineers who only know SQL.

System design excellence: Database selection and data modeling are critical components of system design interviews. Interviewers at the senior level expect you to explain why you chose a particular database and how it works under the hood. See our system design for senior interviews guide.

Career differentiation: Deep database knowledge is relatively rare among application developers. Engineers who understand database internals are sought after for senior and staff roles, particularly in data-intensive domains. See our Staff Engineer salary guide.

Prerequisites

  • SQL proficiency: You should be comfortable writing complex SQL queries including joins, subqueries, aggregations, and window functions.
  • Basic data structures: Arrays, linked lists, hash tables, binary trees, and their time complexities. Database internals build heavily on these foundations.
  • Operating systems basics: File systems, memory management, disk I/O. Databases are fundamentally programs that manage data on disk and in memory efficiently.
  • Algorithms: Sorting algorithms, search algorithms, basic complexity analysis. Many database internals are optimized versions of algorithms you already know.

Learning Path

Week 1-2: Storage Engines

Goal: Understand how databases physically store and retrieve data on disk.

This is the foundation of database internals. Two dominant storage engine architectures:

B-tree storage engines (PostgreSQL, MySQL InnoDB, SQL Server):

  • B-tree structure: how data is organized in pages/blocks for efficient random reads
  • Page layout: how rows are stored within pages, slot arrays, overflow pages
  • Buffer pool: how the database caches pages in memory to avoid disk I/O
  • Write-ahead log (WAL): how changes are durably recorded before modifying data pages
  • Compaction and vacuuming: how B-tree databases reclaim space from deleted/updated rows

LSM-tree storage engines (RocksDB, LevelDB, Cassandra, HBase):

  • Memtable and SSTable: how writes go to memory first, then flush to sorted files on disk
  • Compaction strategies: size-tiered vs leveled compaction and their trade-offs
  • Bloom filters: probabilistic data structures that avoid unnecessary disk reads
  • Write amplification vs read amplification: the fundamental trade-off between B-trees and LSM-trees

Study the trade-offs: B-trees optimize for reads (efficient random lookups), LSM-trees optimize for writes (sequential I/O). Understanding this trade-off is essential for choosing the right database. Review SQL vs NoSQL through the lens of storage engines.

Week 3-4: Indexing Deep Dive

Goal: Understand how indexes work and how to use them effectively.

  • B-tree indexes: The most common index type. Understand how lookups, range scans, and ordered retrieval work. Understand why an index on (a, b) is useful for queries on (a) and (a, b) but not (b) alone.
  • Hash indexes: Constant-time lookups but no range queries. When they make sense.
  • Composite indexes: Column order matters. The leftmost prefix rule. How the optimizer uses composite indexes.
  • Covering indexes: When the index contains all columns needed by the query, avoiding table lookups entirely.
  • Full-text indexes: Inverted indexes for text search. How tokenization, stemming, and ranking work.
  • Index anti-patterns: Over-indexing (slowing writes for marginal read improvement), missing indexes on foreign keys, indexing low-cardinality columns.

Week 5-6: Query Execution and Optimization

Goal: Understand how the database transforms SQL into an execution plan.

The query execution pipeline:

  1. Parsing: SQL text to abstract syntax tree
  2. Planning/Optimization: The query optimizer considers different execution strategies (join algorithms, index usage, scan order) and estimates costs using table statistics
  3. Execution: The chosen plan is executed, with operators (scan, filter, join, sort, aggregate) organized in a tree structure

Key topics:

  • Join algorithms: Nested loop joins, hash joins, merge joins. When the optimizer chooses each one and why.
  • Query plan analysis: Reading EXPLAIN output in PostgreSQL or MySQL. Identifying sequential scans, index scans, bitmap scans, and costly operations.
  • Statistics and cost estimation: How the optimizer estimates row counts and costs. Why stale statistics lead to bad query plans.
  • Common performance pitfalls: Implicit type conversions that prevent index usage, functions in WHERE clauses, correlated subqueries, N+1 query patterns.

Practice: take slow queries from a real application, run EXPLAIN ANALYZE, and rewrite them for better performance. This is the most practical skill in this entire guide.

Week 7-8: Transactions and Concurrency Control

Goal: Understand how databases provide ACID guarantees and handle concurrent access.

  • ACID properties: Atomicity (all or nothing), Consistency (invariants preserved), Isolation (concurrent transactions do not interfere), Durability (committed data survives crashes).
  • Isolation levels: Read Uncommitted, Read Committed, Repeatable Read, Serializable. What anomalies each level permits (dirty reads, non-repeatable reads, phantom reads) and the performance cost of stricter isolation.
  • MVCC (Multi-Version Concurrency Control): How PostgreSQL and MySQL allow readers and writers to proceed concurrently without blocking each other. How versions are created, tracked, and garbage-collected.
  • Locking: Row-level locks, table-level locks, advisory locks, deadlock detection. How lock contention manifests as performance degradation.
  • Write-ahead logging: How WAL ensures durability and enables crash recovery. Checkpoints and log truncation.

Connect this to distributed systems: understand how single-node transaction concepts extend (and break down) in distributed databases. Study distributed transactions and consensus algorithms for the distributed case.

Key Resources

Books:

  • Designing Data-Intensive Applications by Martin Kleppmann — chapters 3 (Storage), 5-7 (Replication, Partitioning, Transactions) are essential
  • Database Internals by Alex Petrov — the most focused treatment of storage engines and distributed database internals
  • High Performance MySQL by Baron Schwartz et al. — practical MySQL performance
  • PostgreSQL Internals — detailed PostgreSQL-specific internals

Courses:

  • CMU 15-445: Database Systems (Andy Pavlo) — excellent lectures freely available on YouTube
  • CMU 15-721: Advanced Database Systems — for deeper study

Blogs:

  • Use The Index, Luke (use-the-index-luke.com) — indexing explained brilliantly
  • Brandur's blog on PostgreSQL internals
  • CockroachDB blog on distributed database design

Practice Projects

  1. Build a simple key-value store: Implement a basic storage engine with an on-disk hash map or B-tree. Support GET, PUT, and DELETE operations. This is the most instructive project for understanding storage engines.

  2. Build an LSM-tree storage engine: Implement a memtable (sorted in memory), SSTable flushing (writing sorted runs to disk), and basic compaction. Add a Bloom filter for efficient negative lookups.

  3. Write a query optimizer for a subset of SQL: Build a simple SQL parser and query planner that can choose between sequential scan and index scan based on estimated selectivity. This demystifies the optimizer.

  4. Implement MVCC: Build a simple key-value store with multi-version concurrency control. Support snapshot reads and detect write conflicts. This teaches you how PostgreSQL and MySQL handle concurrent access.

  5. Performance audit a real database: Take a production database (or a sample dataset), identify the slowest queries, analyze their execution plans, add or modify indexes, and measure the improvement. Document every change and its impact.

How to Know You Are Ready

You understand database internals when you can:

  • Explain how a B-tree index speeds up lookups and what happens at the physical page level when you insert a new row
  • Read a query execution plan (EXPLAIN ANALYZE) and identify the bottleneck in a slow query
  • Choose between PostgreSQL, MySQL, Cassandra, DynamoDB, and MongoDB for a given workload and explain why in terms of storage engine trade-offs
  • Explain MVCC: how concurrent readers and writers proceed without blocking each other, how versions are tracked, and what happens during garbage collection
  • Design a database schema for a complex application with appropriate indexes, considering query patterns, write volumes, and storage efficiency
  • Diagnose lock contention, connection pool exhaustion, or buffer pool pressure in a production database

Next Steps

GO DEEPER

Learn from senior engineers in our 12-week cohort

Our Advanced System Design cohort covers this and 11 other deep-dive topics with live sessions, assignments, and expert feedback.