HDFS is fundamentally a distributed filesystem designed for storing massive datasets across clusters of commodity hardware, built to withstand hardware failures.

Let’s see it in action. Imagine you have a cluster with a few DataNodes. You want to store a large log file, say weblogs.txt, which is 5GB in size.

# On the NameNode
hdfs dfs -mkdir /user/hadoop/logs
hdfs dfs -put weblogs.txt /user/hadoop/logs/

When you run hdfs dfs -put, HDFS doesn’t just copy the file. It breaks weblogs.txt into smaller chunks, typically 128MB or 256MB each. These chunks are called blocks. For our 5GB file, this would be about 40 blocks (5120MB / 128MB per block).

The NameNode, a single master process, keeps track of where each block is located. It doesn’t store the data itself, just the metadata: the file’s name, its directory structure, permissions, and crucially, which DataNodes are storing each block of the file.

The DataNodes are the workhorses. They store the actual data blocks on their local disks. When you put a file, the client talks to the NameNode to get the list of DataNodes for the first block. Then, it streams the block data directly to one DataNode, which in turn forwards it to another, and so on, creating a replication pipeline. By default, each block is replicated 3 times across different DataNodes for fault tolerance.

If a DataNode fails, the NameNode detects this (through heartbeats) and notices that some blocks are now under-replicated. It then instructs other DataNodes to copy the missing blocks from their existing replicas, ensuring the replication factor of 3 is maintained. When you get a file, the client asks the NameNode for the block locations and can then download blocks in parallel from multiple DataNodes, speeding up retrieval.

The core problem HDFS solves is scaling storage and providing high throughput for large files, while abstracting away the complexity of distributed storage and hardware failures. It’s optimized for sequential reads of large files, not for low-latency random access like a traditional database.

The NameNode’s metadata is stored in memory for fast access, but this also means its capacity is limited by the RAM of the machine it runs on. The number of files and blocks it can manage is a direct function of available memory. A common bottleneck is the NameNode’s memory, which needs to be sized appropriately for the total number of files and blocks in the filesystem.

The next concept you’ll likely encounter is how HDFS handles small files, which can be inefficient due to the NameNode’s overhead for each block.

Want structured learning?

Take the full Storage course →