Distributed Consistency: Strong, Eventual, Causal Explained

The CAP theorem is often misunderstood as a simple "pick two out of three" choice, but in reality, distributed systems always sacrifice consistency to achieve availability and partition tolerance in real-world scenarios.

Let’s see what this looks like. Imagine we have two database nodes, Node A and Node B, acting as replicas for a piece of data.

{
  "id": "user:123",
  "name": "Alice",
  "email": "alice@example.com"
}

We want to update Alice’s email.

Scenario 1: Strong Consistency (CP)

A client sends an update request to Node A: UPDATE user:123 SET email='alice.new@example.com'.

Node A receives the request.
Node A writes the new email to its local storage.
Node A then sends the update to Node B.
Node A waits for Node B to acknowledge that it has also successfully written the new email.
Only after Node B acknowledges does Node A respond to the client with success.

Now, if a client immediately queries for user:123 from Node B, they will get the new email. If a network partition occurs between Node A and Node B before Node B acknowledges, Node A will return an error to the client, signaling that consistency could not be guaranteed across all replicas. The system chose Consistency and Partition Tolerance (CP) over Availability during the partition.

Scenario 2: Eventual Consistency (AP)

A client sends the same update request to Node A.

Node A receives the request.
Node A writes the new email to its local storage and immediately responds to the client with success.
Asynchronously, Node A sends the update to Node B.

If a network partition happens after Node A responds to the client but before the update reaches Node B, Node B still has the old email. If a client queries Node B during this partition, they will get the old email, while a client querying Node A would get the new one. This is a temporary inconsistency, but the system remained available. If Node B comes back online, it will eventually receive the update from Node A and reconcile the data. The system chose Availability and Partition Tolerance (AP) over immediate Consistency during the partition.

The CAP theorem states that in the face of a network partition (P), a distributed system must choose between Consistency © and Availability (A). It’s not a static choice; it’s a decision made at the time of a partition or a failure. Most modern distributed systems, especially those prioritizing high availability and low latency for many users, lean towards AP.

Here’s how you might configure a system like Apache Cassandra, a popular AP database, to prioritize availability:

# cassandra.yaml
# ...
consistency_level: ONE
# ...

In Cassandra, ONE means a read or write request only needs to be acknowledged by a single replica. This ensures that even if other replicas are down or unreachable, the request can succeed, thus prioritizing availability. If you were to set consistency_level: QUORUM, you would be demanding acknowledgments from a majority of replicas, which leans more towards CP, but at the cost of availability if a quorum can’t be met.

The real mental model shift is understanding that "Consistency" in CAP doesn’t necessarily mean strong consistency. It means that all nodes see the same data at the same time. In an AP system, this "same time" window is expanded. The system doesn’t crash; it just allows for temporary divergence. The trade-off is that your application logic must be able to handle reading stale data or resolving conflicts that arise from concurrent writes to different partitions of the data.

What most people miss is that the "P" (Partition Tolerance) is non-negotiable in any distributed system that spans multiple network segments. Since you must tolerate partitions, the choice is always between C and A when a partition occurs. Most systems make this choice by default towards AP because the business requirement for being "up" is usually more critical than guaranteeing that every single read reflects the absolute latest write, especially when that guarantee might cause the service to be unavailable. The sophistication lies in how the system manages the reconciliation process when partitions heal and how applications are designed to cope with eventual consistency.

The next problem you’ll grapple with is how to handle data conflicts when they inevitably arise in an AP system, often leading to the exploration of conflict resolution strategies like Last-Write-Wins or CRDTs.

Related Concepts

More Deep Dives in Storage Systems