The most surprising thing about column-family stores is that they aren’t really "columnar" in the way you might expect from a relational database, and their "families" are more about grouping than strict relationships.

Let’s see this in action. Imagine we have a social media feed. A user posts an update, and we want to store that. In Cassandra, we might have a table like posts:

CREATE TABLE posts (
    user_id UUID,
    post_id TIMEUUID,
    content TEXT,
    timestamp TIMESTAMP,
    likes INT,
    PRIMARY KEY (user_id, post_id)
) WITH CLUSTERING ORDER BY (post_id DESC);

Here, user_id is our partition key, and post_id is a clustering key. When we insert a post:

INSERT INTO posts (user_id, post_id, content, timestamp, likes)
VALUES (uuid(), now(), 'My first post!', toTimestamp(now()), 0);

This data isn’t stored as a neat table row. Instead, for a given user_id, all the associated post_id, content, timestamp, and likes are grouped together. The post_id (being a TIMEUUID and clustered in descending order) ensures that the most recent posts for a user are readily accessible. If we query for a specific user’s posts:

SELECT * FROM posts WHERE user_id = <some_user_id>;

Cassandra will efficiently retrieve all posts for that user_id, ordered by post_id (most recent first).

Now, let’s consider HBase. HBase is built on top of Hadoop’s HDFS and uses a similar concept. A table is a sparse, distributed, multi-dimensional map.

<table name="users">
  <columnFamilies>
    <columnFamily name="profile"/>
    <columnFamily name="activity"/>
  </columnFamilies>
</table>

Here, users is the table. profile and activity are column families. When we put data for a user (row key user123):

Put put = new Put(Bytes.toBytes("user123"));
put.addColumn(Bytes.toBytes("profile"), Bytes.toBytes("name"), Bytes.toBytes("Alice"));
put.addColumn(Bytes.toBytes("profile"), Bytes.toBytes("email"), Bytes.toBytes("alice@example.com"));
put.addColumn(Bytes.toBytes("activity"), Bytes.toBytes("last_login"), Bytes.toBytes(System.currentTimeMillis()));
table.put(put);

The data for user123 is stored with a row key. Within that row, data is organized by column family. All columns within a family are stored together. So, name and email (in the profile family) are collocated, and last_login (in the activity family) is separate. This is different from a relational database where all columns for a row are typically stored together.

The core problem these systems solve is scaling to massive datasets and high throughput where traditional relational databases buckle. They achieve this through horizontal scaling and a flexible schema. You can add new "columns" (or attributes within a column family) on the fly without altering the table schema, which is a huge advantage for evolving applications.

Internally, Cassandra uses a Log-Structured Merge-Tree (LSM-tree) structure. Writes are appended to an in-memory memtable and a commit log. Periodically, the memtable is flushed to disk as an immutable SSTable. Reads might consult the memtable, commit log, and multiple SSTables, merging results on the fly. This write-optimized approach makes inserts very fast.

HBase, on the other hand, uses a more direct approach built on HDFS. Each region (a subset of the table’s rows) has MemStore (in-memory writes) and HFiles (on-disk immutable data files). Writes go to the MemStore and WAL (Write-Ahead Log). When MemStore fills up, it’s flushed to HDFS as a new HFile. Compactions merge these HFiles to manage storage and read performance.

The "column family" in Cassandra and HBase is a crucial organizational unit. It’s not just a logical grouping; it’s a physical grouping of data on disk. All columns within a single column family for a given row are stored together. This means if you frequently access columns within the same family, reads are faster. Conversely, if you have a column family with many columns, some of which are rarely accessed, you can incur read overhead for those unused columns when you fetch other columns from the same family. This is why careful design of column families is critical for performance. It’s often better to have more column families with fewer, related columns than fewer column families with many unrelated columns.

The next concept you’ll grapple with is how these systems handle data consistency and availability in a distributed environment.

Want structured learning?

Take the full Storage course →