Zipkin, the distributed tracing system, is fundamentally about making the invisible visible, but what’s truly surprising is how much less you need to instrument than you might think to gain massive insights.
Let’s watch Zipkin in action. Imagine a simple e-commerce request: a user clicks "Add to Cart." This single user action triggers a cascade of microservice calls.
First, the frontend (e.g., a React app) makes an API call to the cart-service.
// Frontend component
async function addToCart(itemId, quantity) {
const response = await fetch('/api/cart/add', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({ itemId, quantity }),
});
const data = await response.json();
// Update UI
return data;
}
The cart-service then needs to check inventory and potentially update the user’s session. It might call an inventory-service and a user-session-service.
// Cart Service (Spring Boot)
@RestController
@RequestMapping("/api/cart")
public class CartController {
private final InventoryServiceClient inventoryServiceClient;
private final UserSessionServiceClient userSessionServiceClient;
public CartController(InventoryServiceClient inventoryServiceClient, UserSessionServiceClient userSessionServiceClient) {
this.inventoryServiceClient = inventoryServiceClient;
this.userSessionServiceClient = userSessionServiceClient;
}
@PostMapping("/add")
public ResponseEntity<CartResponse> addItem(@RequestBody AddItemRequest request) {
// In a real app, you'd get user ID from security context
String userId = "user123";
// Check inventory (this call will be traced)
InventoryStatus inventory = inventoryServiceClient.checkAvailability(request.getItemId());
if (!inventory.isAvailable()) {
return ResponseEntity.badRequest().body(new CartResponse("Item out of stock"));
}
// Update session (this call will be traced)
userSessionServiceClient.addItemToSession(userId, request.getItemId(), request.getQuantity());
return ResponseEntity.ok(new CartResponse("Item added to cart"));
}
}
The inventory-service might query a database.
# Inventory Service (Flask)
from flask import Flask, jsonify, request
import time # Simulate work
app = Flask(__name__)
@app.route('/inventory/<item_id>')
def check_availability(item_id):
# Simulate database lookup
time.sleep(0.05)
# In a real scenario, this would query a DB
available = True # Default to available for demo
if item_id == "nonexistent":
available = False
return jsonify({"itemId": item_id, "isAvailable": available})
if __name__ == '__main__':
app.run(port=5002)
And the user-session-service might update a cache.
// User Session Service (Gin)
package main
import (
"net/http"
"time"
"github.com/gin-gonic/gin"
)
func main() {
router := gin.Default()
router.POST("/session/add", addItemToSession)
router.Run(":5003")
}
func addItemToSession(c *gin.Context) {
var request struct {
UserID string `json:"userId"`
ItemID string `json:"itemId"`
Quantity int `json:"quantity"`
}
if err := c.ShouldBindJSON(&request); err != nil {
c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
return
}
// Simulate cache update
time.Sleep(0.03 * time.Second)
c.JSON(http.StatusOK, gin.H{"message": "Item added to session"})
}
Zipkin’s magic lies in its ability to connect these disparate calls into a single, coherent timeline. It achieves this by propagating a trace ID and span ID across network requests. When cart-service calls inventory-service, it includes these IDs in the HTTP headers. The inventory-service then creates its own spans, referencing the parent span ID from the incoming request.
The core problem Zipkin solves is the "black box" nature of distributed systems. When a request fails or is slow, pinpointing the bottleneck across multiple services is incredibly difficult without a way to see the entire journey. Zipkin provides this visibility by allowing you to visualize the flow of requests as a tree of spans.
Here’s the mental model:
- Trace: Represents a single end-to-end request (e.g., the "Add to Cart" action). All spans within a trace share the same
trace ID. - Span: Represents a single unit of work within a trace (e.g., the
cart-serviceprocessing the request, the call toinventory-service, the database query). Each span has a uniquespan IDand aparent-span ID(unless it’s the root span). Spans capture the start time, duration, service name, operation name (like "add item" or "check inventory"), and any relevant tags or logs. - Instrumentation: This is the code that adds tracing logic to your services. It generates spans, injects trace context (IDs) into outgoing requests, and extracts trace context from incoming requests. Libraries like Brave (Java), OpenTelemetry (multi-language), and instrumented HTTP clients/servers handle this.
- Reporter: Spans are sent by services to a Zipkin collector, usually via HTTP or Kafka.
- Collector: Receives spans, validates them, and stores them.
- Storage: Zipkin typically uses Cassandra or Elasticsearch for storing trace data.
- UI: A web interface to query and visualize traces.
The key to making this work is context propagation. When cart-service makes an HTTP request to inventory-service, it must add specific headers:
# Example headers added by the tracing instrumentation
X-B3-TraceId: a1b2c3d4e5f6
X-B3-SpanId: 1234567890ab
X-B3-Sampled: 1
The receiving service (inventory-service) reads these headers to establish the parent-child relationship for its own spans. This is why you often see libraries like zipkin-reporter-java or OpenTelemetry SDKs configured to inject/extract these B3 headers.
What most people don’t realize is that you don’t need to instrument every single line of code. For many services, instrumenting the incoming request handler and outgoing HTTP client calls is sufficient. The tracing libraries automatically create spans for the request duration and for each downstream HTTP call. You then add custom spans or tags only for critical business logic within a service that you want to isolate.
Once set up, you can query Zipkin’s UI (typically running on port 9411) for traces involving your services. You’ll see a timeline showing the duration of each service call and the dependencies between them. This allows you to quickly identify which service is contributing the most latency or is failing.
After you’ve successfully set up Zipkin and are seeing traces, the next challenge is understanding how to effectively sample your traces to manage the volume of data without losing critical information.