Two-phase commit (2PC) is a distributed algorithm that ensures transactional integrity across multiple, independent systems, but its strict atomicity often becomes a bottleneck, leading developers to seek alternatives like the Saga pattern.

Let’s see 2PC in action with a simplified, conceptual example. Imagine we have two services: OrderService and PaymentService. When a customer places an order, we need to ensure both the order is created and the payment is processed, or neither happens.

Here’s a high-level look at the 2PC flow:

  1. Client Request: A client initiates an order.
  2. Coordinator (e.g., OrderService): The OrderService acts as the coordinator.
    • It sends a "prepare" request to PaymentService. This asks PaymentService if it can complete its part of the transaction.
    • PaymentService attempts to process the payment. If successful, it locks the necessary resources and responds "yes" (prepared). If it fails, it responds "no" (aborted).
  3. Decision:
    • If PaymentService responded "yes", OrderService creates the order and then sends a "commit" request to PaymentService. PaymentService then finalizes the payment and releases its locks.
    • If PaymentService responded "no", OrderService sends an "abort" request to PaymentService. PaymentService then releases its locks and undoes any partial work.

The key here is that no system commits until all systems have agreed to commit.

This strict atomicity is what 2PC guarantees. It’s like a bank transfer: the money leaves one account and arrives in another, and the system ensures that either both steps happen, or neither does. No money is lost or duplicated.

However, the "prepare" phase involves locking resources. In a distributed system, these locks can be held for a significant amount of time, blocking other operations. If the coordinator fails after some participants have prepared but before the commit/abort decision is sent, those resources remain locked indefinitely until the coordinator recovers or a manual intervention occurs. This is the Achilles’ heel of 2PC – it’s too rigid and prone to blocking.

This is where the Saga pattern emerges. Instead of a single, atomic transaction, a saga breaks a large transaction into a sequence of smaller, local transactions. Each local transaction updates its own database and then triggers the next local transaction in the sequence.

Consider the same order processing scenario with Sagas.

  1. Order Created: OrderService creates the order and publishes an OrderCreated event.
  2. Payment Processed: PaymentService listens for OrderCreated. It processes the payment and publishes a PaymentProcessed event.
  3. Inventory Reserved: InventoryService listens for PaymentProcessed. It reserves the inventory and publishes an InventoryReserved event.

Each of these is a local transaction within its respective service. If any step fails, the saga executes compensating transactions to undo the work already done.

For example, if InventoryService fails to reserve inventory:

  1. InventoryService publishes an InventoryReservationFailed event.
  2. PaymentService listens for InventoryReservationFailed and executes its compensating transaction: a refund. It then publishes a PaymentRefunded event.
  3. OrderService listens for PaymentRefunded and executes its compensating transaction: cancelling the order.

The Saga pattern offers higher availability because it doesn’t involve long-held locks across services. Each service commits its local transaction independently. The trade-off is that it’s not truly atomic; you have a period where the overall business operation is in an inconsistent state (e.g., order created, payment processed, but inventory not yet reserved). You also need to design and implement these compensating actions, which can be complex.

A common Saga implementation is the "Choreography" approach, where services communicate via events. Each service is responsible for listening to events and deciding what to do next, including triggering compensating actions. This is what we illustrated above. The alternative is "Orchestration," where a dedicated orchestrator service manages the sequence of local transactions and compensating actions.

The most surprising thing about Sagas is that they don’t guarantee atomicity in the traditional sense; instead, they guarantee eventual consistency through a series of idempotent local transactions and their compensating counterparts. This means that even if messages are delivered multiple times, the system will reach the same final state.

Let’s look at a simplified event handler for the PaymentService in a choreography-based saga:

// PaymentService - OrderCreated Event Handler
{
  "eventType": "OrderCreated",
  "payload": {
    "orderId": "ORD-12345",
    "customerId": "CUST-67890",
    "amount": 100.00
  },
  "sagaId": "SAGA-ABCDEF"
}

// --- PaymentService Logic ---
// 1. Check if this sagaId has already been processed to ensure idempotency.
// 2. Attempt to charge customer CUST-67890 for 100.00.
// 3. If successful:
//    Publish { "eventType": "PaymentProcessed", "payload": { "orderId": "ORD-12345", "paymentId": "PAY-54321" }, "sagaId": "SAGA-ABCDEF" }
// 4. If failed (e.g., insufficient funds):
//    Publish { "eventType": "PaymentFailed", "payload": { "orderId": "ORD-12345", "reason": "Insufficient Funds" }, "sagaId": "SAGA-ABCDEF" }

The sagaId is crucial for managing the overall business transaction and ensuring that each step is executed only once, even if events are replayed.

The real complexity often lies in designing the compensating transactions. A compensating transaction must be the exact inverse of the original transaction. For instance, if the original transaction was "add item to cart," the compensating transaction is "remove item from cart." If the original was "charge credit card," the compensation is "issue refund." Sometimes, a direct inverse isn’t possible, and a different action must be taken to achieve the desired end state.

The next challenge you’ll face is managing the lifecycle of sagas, especially when dealing with long-running processes and ensuring that compensating actions themselves don’t fail.

Want structured learning?

Take the full System Design course →