System Design: Distributed File System (HDFS-style)

Requirements

Functional Requirements:

Store large files (hundreds of MB to TB) across a cluster of commodity nodes
Read and write files in a streaming fashion (optimized for batch, not random access)
Fault tolerance — survive node failures without data loss
Support MapReduce-style data locality (move computation to data)
Namespace operations: mkdir, rename, delete, list directory
Replication factor configurable per file (default 3)

Non-Functional Requirements:

Sustain 10 GB/s aggregate read throughput across 1,000 nodes
Store 100 PB total
Tolerate simultaneous failure of 3 nodes without data loss
Namespace operations complete in under 10ms
99.9% availability for reads

Scale Estimation

HDFS uses 128 MB default block size. A 1 TB file is split into 8,192 blocks. With 3x replication, physical storage is 3 TB per 1 TB file. For 100 PB raw data with 3x replication, total physical storage is 300 PB. With 12 TB HDDs per DataNode, storing 300 PB requires 25,000 DataNodes. The NameNode holds metadata for all blocks in memory: for 100 PB / 128 MB = 800 million blocks, each requiring ~150 bytes of metadata (block ID, block locations, replication state), total NameNode memory is ~120 GB. A high-memory server (256 GB RAM) can handle this. HDFS Federation (multiple NameNodes each owning a namespace subtree) scales beyond single-NameNode limits.

High-Level Architecture

HDFS follows a master-worker architecture. A single NameNode (or a Standby NameNode for HA) manages the filesystem namespace (directory tree and file-to-block mappings) entirely in memory. DataNodes store the actual block data on local disks. Clients communicate with the NameNode for metadata operations and directly with DataNodes for data transfer. This separation is key: the NameNode handles only metadata (fast, memory-resident), while DataNodes handle bulk data I/O (distributed, parallel).

For a file write, the client splits the file into 128 MB blocks. For each block, the NameNode allocates 3 DataNodes using a rack-aware placement policy: first replica on a random node in the client's rack, second replica on a node in a different rack, third replica on a different node in the same rack as the second. This ensures at least one replica survives a full rack failure. The client writes blocks in a streaming pipeline: data flows from client → DataNode 1 → DataNode 2 → DataNode 3, with each node acknowledging back up the chain. The NameNode is not in the data path — it only records block locations after all 3 replicas confirm success.

For a file read, the client asks the NameNode for block locations. The NameNode returns the list of DataNodes holding each block, sorted by proximity to the client (same rack first, then same data center, then remote). The client reads directly from the closest DataNode per block, parallelizing across multiple blocks for throughput. If a DataNode fails during read, the client fails over to the next replica transparently.

Core Components

NameNode (Namespace Master)

The NameNode maintains two in-memory data structures: the FsImage (filesystem tree — directories and files, file attributes, block IDs per file) and the BlockMap (block ID → list of DataNode locations). FsImage is persisted to disk periodically. An EditLog (write-ahead log) records every namespace mutation (create, delete, rename). On restart, NameNode replays the EditLog against the last FsImage checkpoint. A Secondary NameNode (or Standby NameNode in HA mode) periodically merges EditLog into a new FsImage checkpoint, preventing the EditLog from growing unboundedly. In HA mode, Active and Standby NameNodes share the EditLog via a Quorum Journal Manager (3+ Journal Nodes using Paxos).

DataNode (Block Store)

Each DataNode stores blocks as files in the local OS filesystem (ext4 or xfs). Blocks are stored in a flat directory structure (multiple configured directories, one per disk) using the block ID as filename. On startup, each DataNode sends a Block Report to the NameNode listing all blocks it holds, allowing the NameNode to reconstruct the BlockMap after restarts. DataNodes send Heartbeats every 3 seconds to the NameNode; if the NameNode doesn't receive a heartbeat for 10 minutes, it marks the node as dead and schedules re-replication of its blocks. Corrupt blocks are detected via per-block CRC checksums stored alongside each block file.

Rack-Aware Replication

Rack awareness is critical for balancing fault tolerance against write performance. The default (2+1) rack placement policy stores 2 replicas on nodes in one rack and 1 replica in another. This tolerates any single rack failure (2 replicas still available in the other rack) while keeping 2 of 3 write-pipeline hops within the same rack (lower latency for the common case). For write pipelines, intra-rack bandwidth (10–40 Gbps) is higher than cross-rack bandwidth (1–10 Gbps). Rack topology is configured via a DNS-based rack resolver or a static topology mapping file.

Database Design

The NameNode stores its metadata in memory-resident data structures: a hash map of inode_id → INodeFile (block list, permissions, replication factor, timestamps) and a hash map of block_id → BlockInfo (DataNode list, replication state). Persistence is via FsImage (serialized protobuf snapshot) + EditLog (append-only sequence of FSEditLogOp records). The FsImage and EditLog are stored on multiple disks (and optionally on NFS for disaster recovery). HDFS Encryption Zones store file encryption keys in a HDFS Key Management Server (KMS), which is a separate service backed by a secure key store (e.g., HashiCorp Vault).

API Design

Scaling & Bottlenecks

The NameNode is the single most significant bottleneck in classic HDFS. With all filesystem metadata in memory on one server, the maximum namespace size is bounded by NameNode RAM (typically 256–512 GB, supporting ~1 billion files). HDFS Federation addresses this by partitioning the namespace into multiple independent NameNode instances, each owning a different namespace subtree. A router-based federation (RBF) provides a unified namespace view to clients while routing operations to the correct NameNode. This enables near-linear namespace scaling by adding more NameNodes.

Small-file problem: HDFS is optimized for large files. Storing millions of small files (< 1 MB each) consumes NameNode memory (one inode per file) but stores little data. A 1 KB file still consumes 150 bytes of NameNode metadata. 100 million such files exhaust NameNode memory while storing only 100 GB of data. Solutions: HAR files (Hadoop Archive — combine many small files into one large archive); SequenceFiles (container format packing multiple key-value records into one large block); HBase (built on HDFS but manages its own block layout for random access).

Key Trade-offs

Replication factor vs. storage overhead: 3x replication provides strong fault tolerance but costs 3x storage; erasure coding (HDFS-EC, e.g., RS-6-3) reduces overhead to 1.5x at the cost of reconstruction CPU and higher read latency for degraded reads
Block size vs. NameNode scalability: Larger blocks (128 MB or 256 MB) mean fewer blocks per file, reducing NameNode memory usage but hurting parallelism for small files
HA NameNode complexity vs. single NameNode simplicity: HA with QJM eliminates the single point of failure but requires 3 Journal Nodes, ZKFC (ZooKeeper Failover Controller), and careful fencing configuration
Data locality vs. load balancing: Always reading from the closest replica maximizes throughput but can overload hot DataNodes holding popular blocks; a load-aware replica selection policy balances both