Distributed filesystems don’t just let you store more data than fits on a single machine; they fundamentally change how you reason about data availability and performance.

Let’s look at HDFS, GlusterFS, and CephFS in action. Imagine we’re setting up a small cluster for storing large video files.

HDFS (Hadoop Distributed Filesystem)

HDFS is built for massive, write-once, read-many workloads. Think of it as optimized for appending to huge files, not for frequent small writes or random access.

  • How it works: A NameNode manages the filesystem namespace (directories, file names) and the mapping of file blocks to DataNodes. DataNodes store the actual file blocks. When you write a file, it’s broken into blocks (default 128MB), and these blocks are replicated across multiple DataNodes for fault tolerance. Reads go through the NameNode to find block locations, then directly to DataNodes.

  • Configuration Example (Simplified hdfs-site.xml):

    <property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///data/hadoop/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///data/hadoop/datanode</value>
    </property>
    

    Here, dfs.replication of 3 means every block will be copied to three different DataNodes. dfs.namenode.name.dir and dfs.datanode.data.dir specify where the NameNode and DataNodes store their critical metadata and actual data, respectively.

  • Use Case: Storing massive datasets for big data analytics (like processing logs with Spark or MapReduce).

GlusterFS

GlusterFS is a clustered filesystem that aggregates disk storage resources from multiple servers into a single global namespace. It’s known for its flexibility and ease of setup for scale-out NAS.

  • How it works: GlusterFS uses a "translator" architecture. You define "bricks" (which are just directories on individual servers) and then combine them into "volumes" using various translators. For example, a distributed volume spreads files across bricks. A replicated volume ensures data is identical on multiple bricks. A stripe translator breaks files into chunks and distributes them across bricks for higher throughput.

  • Configuration Example (Setting up a replicated volume): On server server1 and server2, you’d have directories like /gluster/brick1. Then, on any client or server, you’d create the volume:

    gluster volume create my_replicated_volume replica 2 server1:/gluster/brick1 server2:/gluster/brick1 force
    gluster volume start my_replicated_volume
    

    This creates a volume named my_replicated_volume that replicates data across server1:/gluster/brick1 and server2:/gluster/brick1. The replica 2 means two copies.

  • Use Case: Centralized storage for virtual machines, media repositories, or general-purpose file sharing where high availability and scalability are key.

CephFS

CephFS is a POSIX-compliant distributed filesystem built on top of Ceph’s distributed object store (RADOS). It’s designed for performance, scalability, and extreme durability.

  • How it works: Ceph uses a distributed map (CRUSH algorithm) to determine where data objects reside, eliminating a single point of metadata bottleneck. It has separate components: Object Storage Daemons (OSDs) for storing data, Monitors (MONs) for cluster state, and Metadata Servers (MDSs) for managing the filesystem namespace and caching. Files are broken into objects, and these objects are distributed and replicated across OSDs.

  • Configuration Example (Mounting a CephFS filesystem): Assuming you have a running Ceph cluster with a CephFS volume already created and an admin keyring (ceph.client.admin.keyring):

    sudo mount -t ceph {ceph-monitor-ip-1}:6789,{ceph-monitor-ip-2}:6789:/ {mount-point} -o name=admin,secretfile=/path/to/ceph.client.admin.keyring
    

    This command mounts the CephFS filesystem from your Ceph monitors (e.g., 192.168.1.10:6789,192.168.1.11:6789) to a local directory ({mount-point}). It uses the admin user and a specified secret file for authentication.

  • Use Case: High-performance, scalable storage for cloud platforms (like OpenStack), big data, and general-purpose file serving requiring strong consistency and POSIX compliance.

The most surprising thing about these systems is how they handle metadata. While HDFS has a centralized NameNode (a potential bottleneck), GlusterFS distributes metadata across its bricks based on the chosen volume type, and CephFS uses dedicated Metadata Servers (MDS) that can be scaled out and are managed by the CRUSH algorithm for object placement, offering a more dynamic and performant metadata layer.

When you’re working with distributed filesystems, understanding the trade-offs between consistency, availability, and performance for your specific workload is paramount. Each system offers different tuning knobs, from replication factors and block sizes in HDFS to volume types and striping in GlusterFS, and RADOS pool configurations in Ceph.

The next challenge you’ll likely encounter is managing the network traffic and latency between nodes, especially for write-heavy or latency-sensitive applications.

Want structured learning?

Take the full Storage course →