Monitoring and Observability in System Design: Metrics, Traces (2026)

Metrics and traces are often talked about together, but they actually solve fundamentally different problems about your system’s health.

Let’s look at metrics first. Imagine you’re running a busy coffee shop. Metrics are like the quick glance at your cash register totals at the end of the day, or the number of customers served in the last hour. They give you a summary of what’s been happening over a period of time.

Here’s a simple Go program that exposes some basic HTTP metrics:

package main

import (
	"net/http"
	"time"

	"github.com/prometheus/client_golang/prometheus"
	"github.com/prometheus/client_golang/prometheus/promauto"
	"github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
	httpRequests = promauto.NewCounterVec(prometheus.CounterOpts{
		Name: "myapp_http_requests_total",
		Help: "Total number of HTTP requests received.",
	}, []string{"method", "path"})

	httpRequestDuration = promauto.NewHistogramVec(prometheus.HistogramOpts{
		Name: "myapp_http_request_duration_seconds",
		Help: "Histogram of latencies for HTTP requests.",
		Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
	}, []string{"method", "path"})
)

func main() {
	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		start := time.Now()
		// Simulate some work
		time.Sleep(time.Duration(float64(time.Now.UnixNano())/1e9 * 0.5) * time.Second)

		httpRequests.WithLabelValues(r.Method, r.URL.Path).Inc()
		httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(time.Since(start).Seconds())

		w.Write([]byte("Hello, world!"))
	})

	http.Handle("/metrics", promhttp.Handler())
	http.ListenAndServe(":8080", nil)
}

When this runs, you can hit http://localhost:8080/ a few times, and then scrape http://localhost:8080/metrics with a tool like Prometheus. You’ll see output like this:

# HELP myapp_http_requests_total Total number of HTTP requests received.
# TYPE myapp_http_requests_total counter
myapp_http_requests_total{method="GET",path="/"} 15
# HELP myapp_http_request_duration_seconds Histogram of latencies for HTTP requests.
# TYPE myapp_http_request_duration_seconds histogram
myapp_http_request_duration_seconds_bucket{le="0.005",method="GET",path="/"} 0
myapp_http_request_duration_seconds_bucket{le="0.01",method="GET",path="/"} 0
myapp_http_request_duration_seconds_bucket{le="0.025",method="GET",path="/"} 0
myapp_http_request_duration_seconds_bucket{le="0.05",method="GET",path="/"} 0
myapp_http_request_duration_seconds_bucket{le="0.1",method="GET",path="/"} 0
myapp_http_request_duration_seconds_bucket{le="0.25",method="GET",path="/"} 0
myapp_http_request_duration_seconds_bucket{le="0.5",method="GET",path="/"} 10
myapp_http_request_duration_seconds_bucket{le="1",method="GET",path="/"} 15
myapp_http_request_duration_seconds_bucket{le="2.5",method="GET",path="/"} 15
myapp_http_request_duration_seconds_bucket{le="5",method="GET",path="/"} 15
myapp_http_request_duration_seconds_bucket{le="10",method="GET",path="/"} 15
myapp_http_request_duration_seconds_count 15
myapp_http_request_duration_seconds_sum 7.5

This tells you that 15 requests came in for the root path using GET, and their durations are distributed across buckets. You can see the rate of requests, the average duration, or the 95th percentile duration. Metrics are great for understanding overall system health, identifying trends, and setting alerts for deviations from the norm. They are aggregated data.

Now, traces. Traces are like following a single customer from the moment they walk into your coffee shop, through ordering, waiting for their drink, paying, and finally leaving. They show you the journey of a single request as it travels through your distributed system.

Consider a request hitting a web server, which then calls a user service, then an order service, and finally a payment service. A trace captures the timing and flow of each of these steps for that one request.

Here’s a conceptual example of what a trace might look like (this isn’t runnable code, but illustrates the data):

{
  "traceId": "a1b2c3d4e5f67890",
  "spans": [
    {
      "spanId": "1111111111111111",
      "traceId": "a1b2c3d4e5f67890",
      "parentId": null,
      "name": "GET /api/orders",
      "startTime": "2023-10-27T10:00:00.123Z",
      "duration": "150ms",
      "tags": { "http.method": "GET", "http.url": "/api/orders" },
      "logs": [ { "timestamp": "2023-10-27T10:00:00.123Z", "event": "request received" } ]
    },
    {
      "spanId": "2222222222222222",
      "traceId": "a1b2c3d4e5f67890",
      "parentId": "1111111111111111",
      "name": "UserService.GetUser",
      "startTime": "2023-10-27T10:00:00.150Z",
      "duration": "50ms",
      "tags": { "user.id": "user123" },
      "logs": [ { "timestamp": "2023-10-27T10:00:00.150Z", "event": "fetching user" } ]
    },
    {
      "spanId": "3333333333333333",
      "traceId": "a1b2c3d4e5f67890",
      "parentId": "1111111111111111",
      "name": "OrderService.GetOrderDetails",
      "startTime": "2023-10-27T10:00:00.170Z",
      "duration": "80ms",
      "tags": { "order.id": "order456" }
    },
    {
      "spanId": "4444444444444444",
      "traceId": "a1b2c3d4e5f67890",
      "parentId": "3333333333333333",
      "name": "PaymentService.ProcessPayment",
      "startTime": "2023-10-27T10:00:00.200Z",
      "duration": "30ms",
      "tags": { "payment.status": "success" }
    }
  ]
}

This shows a request to /api/orders that took 150ms. Within that, it called UserService.GetUser (50ms) and OrderService.GetOrderDetails (80ms). The order service, in turn, called PaymentService.ProcessPayment (30ms). Traces are crucial for debugging performance bottlenecks and understanding the root cause of errors in complex, distributed systems. They provide context and causality.

The most surprising true thing about distributed tracing is that it’s the only way to reliably understand latency across service boundaries without making assumptions.

When you’re debugging a slow request in a microservices architecture, you can’t just look at the CPU usage of each service. You need to see where the time is being spent. A trace allows you to pinpoint which service or even which specific operation within a service is contributing most to the overall latency. You can visualize this as a waterfall chart, where each bar represents a span (an operation) and its width indicates its duration. The parent-child relationships show the call stack.

Metrics tell you that something is slow or failing; traces tell you why and where. You need both for a complete picture. Metrics provide the high-level overview, like spotting a surge in error rates. Traces let you dive deep into a specific slow or erroneous request to diagnose the exact problem.

The next concept you’ll run into is how to collect and aggregate all this data, often involving tools like Prometheus for metrics and Jaeger or Zipkin for traces, and how to correlate them.

More Deep Dives in System Design