Two-phase commit (2PC) is a distributed algorithm that ensures transactional integrity across multiple, independent systems, but its strict atomicity often becomes a bottleneck, leading developers to seek alternatives like the Saga pattern.
Let’s see 2PC in action with a simplified, conceptual example. Imagine we have two services: OrderService and PaymentService. When a customer places an order, we need to ensure both the order is created and the payment is processed, or neither happens.
Here’s a high-level look at the 2PC flow:
- Client Request: A client initiates an order.
- Coordinator (e.g., OrderService): The
OrderServiceacts as the coordinator.- It sends a "prepare" request to
PaymentService. This asksPaymentServiceif it can complete its part of the transaction. PaymentServiceattempts to process the payment. If successful, it locks the necessary resources and responds "yes" (prepared). If it fails, it responds "no" (aborted).
- It sends a "prepare" request to
- Decision:
- If
PaymentServiceresponded "yes",OrderServicecreates the order and then sends a "commit" request toPaymentService.PaymentServicethen finalizes the payment and releases its locks. - If
PaymentServiceresponded "no",OrderServicesends an "abort" request toPaymentService.PaymentServicethen releases its locks and undoes any partial work.
- If
The key here is that no system commits until all systems have agreed to commit.
This strict atomicity is what 2PC guarantees. It’s like a bank transfer: the money leaves one account and arrives in another, and the system ensures that either both steps happen, or neither does. No money is lost or duplicated.
However, the "prepare" phase involves locking resources. In a distributed system, these locks can be held for a significant amount of time, blocking other operations. If the coordinator fails after some participants have prepared but before the commit/abort decision is sent, those resources remain locked indefinitely until the coordinator recovers or a manual intervention occurs. This is the Achilles’ heel of 2PC – it’s too rigid and prone to blocking.
This is where the Saga pattern emerges. Instead of a single, atomic transaction, a saga breaks a large transaction into a sequence of smaller, local transactions. Each local transaction updates its own database and then triggers the next local transaction in the sequence.
Consider the same order processing scenario with Sagas.
- Order Created:
OrderServicecreates the order and publishes anOrderCreatedevent. - Payment Processed:
PaymentServicelistens forOrderCreated. It processes the payment and publishes aPaymentProcessedevent. - Inventory Reserved:
InventoryServicelistens forPaymentProcessed. It reserves the inventory and publishes anInventoryReservedevent.
Each of these is a local transaction within its respective service. If any step fails, the saga executes compensating transactions to undo the work already done.
For example, if InventoryService fails to reserve inventory:
InventoryServicepublishes anInventoryReservationFailedevent.PaymentServicelistens forInventoryReservationFailedand executes its compensating transaction: a refund. It then publishes aPaymentRefundedevent.OrderServicelistens forPaymentRefundedand executes its compensating transaction: cancelling the order.
The Saga pattern offers higher availability because it doesn’t involve long-held locks across services. Each service commits its local transaction independently. The trade-off is that it’s not truly atomic; you have a period where the overall business operation is in an inconsistent state (e.g., order created, payment processed, but inventory not yet reserved). You also need to design and implement these compensating actions, which can be complex.
A common Saga implementation is the "Choreography" approach, where services communicate via events. Each service is responsible for listening to events and deciding what to do next, including triggering compensating actions. This is what we illustrated above. The alternative is "Orchestration," where a dedicated orchestrator service manages the sequence of local transactions and compensating actions.
The most surprising thing about Sagas is that they don’t guarantee atomicity in the traditional sense; instead, they guarantee eventual consistency through a series of idempotent local transactions and their compensating counterparts. This means that even if messages are delivered multiple times, the system will reach the same final state.
Let’s look at a simplified event handler for the PaymentService in a choreography-based saga:
// PaymentService - OrderCreated Event Handler
{
"eventType": "OrderCreated",
"payload": {
"orderId": "ORD-12345",
"customerId": "CUST-67890",
"amount": 100.00
},
"sagaId": "SAGA-ABCDEF"
}
// --- PaymentService Logic ---
// 1. Check if this sagaId has already been processed to ensure idempotency.
// 2. Attempt to charge customer CUST-67890 for 100.00.
// 3. If successful:
// Publish { "eventType": "PaymentProcessed", "payload": { "orderId": "ORD-12345", "paymentId": "PAY-54321" }, "sagaId": "SAGA-ABCDEF" }
// 4. If failed (e.g., insufficient funds):
// Publish { "eventType": "PaymentFailed", "payload": { "orderId": "ORD-12345", "reason": "Insufficient Funds" }, "sagaId": "SAGA-ABCDEF" }
The sagaId is crucial for managing the overall business transaction and ensuring that each step is executed only once, even if events are replayed.
The real complexity often lies in designing the compensating transactions. A compensating transaction must be the exact inverse of the original transaction. For instance, if the original transaction was "add item to cart," the compensating transaction is "remove item from cart." If the original was "charge credit card," the compensation is "issue refund." Sometimes, a direct inverse isn’t possible, and a different action must be taken to achieve the desired end state.
The next challenge you’ll face is managing the lifecycle of sagas, especially when dealing with long-running processes and ensuring that compensating actions themselves don’t fail.