Ceph Architecture: CRUSH & Kubernetes Deep Dive

Ceph storage clusters can scale to petabytes and beyond, but most of the time, they’re not actually storing your data directly.

Let’s see how this works with a simulated write operation. Imagine we have a client that wants to write a 4MB object, my_object.dat, to a Ceph cluster.

# On the Ceph client machine
ceph osd map my_object.dat

This command, ceph osd map, is the first step in understanding where data lands. It tells us which Placement Group (PG) the object belongs to and which OSDs (Object Storage Daemons) are responsible for that PG. Ceph doesn’t map objects directly to OSDs; it uses PGs as an intermediate layer.

Let’s say ceph osd map outputs:

osdmap e551 pool 'my_pool' object 'my_object.dat' -> pg 3.123456789d68f9a7 pgp 3.123456789d68f9a7::123456789d68f9a7 acting [1, 5, 8]

This tells us that my_object.dat maps to PG 3.123456789d68f9a7. The acting list [1, 5, 8] indicates that OSDs 1, 5, and 8 are currently responsible for this PG. Ceph uses CRUSH (Controlled Replication Under Scalable Hashing) to deterministically map objects to PGs, and then PGs to OSDs. This ensures that even as OSDs are added or removed, data can be quickly remapped with minimal disruption.

The client’s next step is to find out which OSD is the "primary" for this PG. It queries the cluster’s monitors (MONs) for the current OSD map. The MONs maintain the authoritative state of the cluster, including the OSD map, PG map, and CRUSH map.

# On the Ceph client machine, after knowing the PG ID (e.g., 3.123456789d68f9a7)
ceph pg getmap 3.123456789d68f9a7

The output will indicate the primary OSD for that PG. Let’s assume OSD 1 is the primary. The client then sends the write request directly to OSD 1.

OSD 1, as the primary for PG 3.123456789d68f9a7, is responsible for orchestrating the replication of this object. It doesn’t just store the object and say "done." Instead, it forwards the data to the other OSDs in the acting set – OSD 5 and OSD 8 in our example. This is part of Ceph’s primary-based replication protocol.

The data travels from OSD 1 to OSD 5, and from OSD 1 to OSD 8. Each of these OSDs writes the object data to its local disk. Once OSD 5 and OSD 8 confirm they have successfully written the data, they acknowledge this back to OSD 1.

Only after OSD 1 receives acknowledgments from a quorum of replicas (which, for a pool with a size of 3, means at least two OSDs, including itself, have confirmed the write) does it acknowledge the write operation back to the client. This ensures data durability according to the pool’s configured size and min_size parameters.

The my_pool in our example is a Ceph pool, which is a logical collection of PGs. Pools define properties like the replication size, the number of PGs, and the type of data (e.g., replicated or erasure-coded). When you create a Ceph cluster, you typically create one or more pools to store your data.

The CRUSH map is the secret sauce that makes Ceph scalable and resilient. It’s a hierarchical representation of your cluster’s hardware, allowing you to define rules for how data is placed. For instance, you can create rules to ensure replicas of a PG are placed on different hosts, racks, or even data centers, protecting against failures at various levels. You can inspect the CRUSH map with:

ceph osd getcrushmap -o crushmap.bin
crushtool -d crushmap.bin -o crushmap.txt

The crushmap.txt file shows the rules and hierarchy you’ve defined, like hosts, racks, and buckets.

The core idea here is that Ceph decouples data placement from specific storage devices. The Monitors provide the cluster state, CRUSH defines the placement logic, PGs act as logical shards of data, and OSDs are the actual storage daemons. This distributed, layered approach is what allows Ceph to scale so effectively.

What most people don’t realize is that the OSDs don’t just store data chunks; they actively participate in a complex consensus protocol for writes and handle the mechanics of data distribution and recovery. The client writes to a primary OSD, but that primary OSD doesn’t confirm the write until it has coordinated with other OSDs in the PG’s acting set, ensuring replication happens reliably.

The next thing you’ll likely encounter is how Ceph handles object deletion, which involves a similar but reversed process of PGs tracking deletion requests.

Related Concepts

More Deep Dives in Storage Systems