Observability isn’t just about collecting data; it’s about understanding the emergent behavior of complex systems you didn’t fully design.

Let’s see what that looks like in practice. Imagine a simple web service that needs to fetch user data from a database and then an external API.

// User Service
func getUserProfile(userID string) (*UserProfile, error) {
	// 1. Fetch user data from internal DB
	user, err := db.GetUser(userID)
	if err != nil {
		// Log this error
		return nil, fmt.Errorf("db error: %w", err)
	}

	// 2. Fetch user details from external API
	details, err := externalAPI.GetDetails(user.ExternalID)
	if err != nil {
		// Log this error
		return nil, fmt.Errorf("api error: %w", err)
	}

	// Combine and return
	return &UserProfile{
		Name: user.Name,
		Email: user.Email,
		ExternalInfo: details,
	}, nil
}

Now, how do we observe this?

Metrics are the aggregations. They tell you how many or how much. For our service, we’d track:

  • Request Rate: http_requests_total{method="GET", path="/users/{id}"}. This is a counter, increasing with every request.
  • Error Rate: http_requests_total{method="GET", path="/users/{id}", status="5xx"}. Another counter, but only for errors.
  • Latency: http_request_duration_seconds{method="GET", path="/users/{id}"}. This is a histogram or summary, capturing how long requests take.

You’d typically see these in a dashboard like Grafana, showing trends over time. A sudden spike in the error rate or latency metric for /users/{id} would immediately tell you something is wrong with this endpoint.

Logs are the discrete events. They provide context for why something happened. When db.GetUser or externalAPI.GetDetails fails, we log it:

// Logging the DB error
log.Error().Str("userID", userID).Err(err).Msg("Failed to fetch user from database")

// Logging the API error
log.Error().Str("userID", userID).Str("externalID", user.ExternalID).Err(err).Msg("Failed to fetch details from external API")

These logs, often sent to a centralized system like Elasticsearch or Loki, would show us the specific error messages from the database driver or the HTTP client, giving us clues about the root cause. If the error rate metric spiked, we’d then filter logs for that time window and endpoint to see the specific errors.

Traces connect the dots across services. When getUserProfile calls db.GetUser and externalAPI.GetDetails, a trace links these operations together. Each step (a "span") has a start time, duration, and metadata.

A trace for a successful getUserProfile call might look like:

  • GET /users/{id} (Root Span)
    • db.GetUser (Child Span)
    • externalAPI.GetDetails (Child Span)

If externalAPI.GetDetails is slow, the trace visually shows that GET /users/{id} is taking a long time because externalAPI.GetDetails is slow. This is crucial for distributed systems where a single user-facing request might involve dozens of internal and external calls. Tools like Jaeger or Zipkin visualize these.

Profiles reveal performance bottlenecks within a service. If getUserProfile is slow, metrics tell us that it’s slow, logs might tell us why (e.g., an external service is slow), but profiling tells us where the time is spent inside our own code.

For example, a CPU profile might show that a significant amount of time is spent in a JSON unmarshalling function or a complex data transformation, even if no errors were logged. This helps optimize code that is correct but inefficient. Tools like pprof in Go or Java Flight Recorder provide these insights.

The true power emerges when you combine them. A user reports slowness. Your metrics show increased latency for /users/{id}. You look at traces for requests during that period and see that the externalAPI.GetDetails span is consistently taking longer. You then check logs for that specific trace ID and find repeated "timeout" errors when calling the external API. Finally, you might use profiling on your service during high load to see if your own code is contributing to timeouts by holding connections open too long.

Most people think of metrics, logs, and traces as distinct tools, but they are really different lenses on the same underlying system behavior. The key is to instrument your code so that these different data types are linked, typically via a common Trace ID. Without this linkage, you’re just collecting mountains of data without the ability to connect the dots.

The next step is understanding how to effectively set sampling rates for traces to balance cost and visibility.

Want structured learning?

Take the full Sre course →