Log Aggregation: From Chaos to Clarity

Structured logging isn’t just about making logs human-readable; it’s about making them machine-readable, which is crucial for analyzing events across vast, distributed systems. The most surprising truth about structured logging at scale is that not all logs are created equal, and trying to capture everything can break your system more than it helps.

Consider a high-throughput microservice handling thousands of requests per second. Each request might generate dozens of log lines, each with a unique request ID, user ID, timestamp, and various event details. If you try to ingest and store every single log line from every service, you’ll quickly drown in data. Storage costs explode, querying becomes impossibly slow, and your logging infrastructure itself becomes a bottleneck.

This is where sampling and structured logging truly shine together. Let’s imagine a simple user_login event.

Unstructured Log (Bad): 2023-10-27 10:30:01 INFO User 'alice' logged in from 192.168.1.100

Structured Log (Good):

{
  "timestamp": "2023-10-27T10:30:01Z",
  "level": "INFO",
  "message": "User logged in",
  "user_id": "alice",
  "ip_address": "192.168.1.100",
  "event_type": "user_login"
}

The structured log can be parsed by machines. You can easily query for all user_login events, or all events from user_id: alice. But if alice logs in a million times a day, you don’t necessarily need to see every single one if the system is healthy.

This is where sampling comes in. We can configure our logging agent or application to sample logs based on certain criteria. For instance, we might decide to log:

100% of ERROR and FATAL level logs.
10% of WARN level logs.
1% of INFO level logs.
0% of DEBUG level logs (unless debugging is explicitly enabled for a specific trace).

Let’s say we have a service, auth-service, and we want to log user logins. In our application code, we might use a library that supports structured logging and conditional sampling.

import logging
import json
import random

# Assume a structured logging formatter is configured elsewhere
# For demonstration, we'll just print JSON strings

def log_user_login(user_id: str, ip_address: str):
    log_data = {
        "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
        "level": "INFO",
        "message": "User logged in",
        "user_id": user_id,
        "ip_address": ip_address,
        "event_type": "user_login"
    }

    # Sampling logic: log 1% of INFO level user_login events
    if log_data["level"] == "INFO" and log_data["event_type"] == "user_login":
        if random.random() < 0.01: # 1% chance
            print(json.dumps(log_data))
    else:
        # Log all other levels or event types without sampling
        print(json.dumps(log_data))

# Example usage:
# Imagine this is called millions of times a day
# log_user_login("alice", "192.168.1.100")
# log_user_login("bob", "10.0.0.5")

The random.random() < 0.01 is the core of the sampling. It means that, on average, only 1 out of every 100 INFO level user_login events will actually be sent to the logging backend.

When an error does occur, we need to be able to trace it. This is where a unique trace_id or request_id becomes paramount. Every log line associated with a single request or operation should carry the same trace_id. If an error happens, we can then filter our sampled logs for that specific trace_id and get a complete picture of what happened during that single, problematic transaction.

This strategy allows us to keep the volume of logs manageable while still providing the ability to perform deep dives when necessary. For critical events like errors, we’d configure our system to never sample them. A common pattern is to have separate configurations for different log levels or event types.

Consider a logging agent like Fluentd or Filebeat. You’d configure input sources, filters, and outputs. A filter might look like this (simplified Fluentd configuration):

<filter **>
  @type record_transformer
  enable_ruby true
  <record>
    trace_id ${record['trace_id']} # Pass through existing trace_id
    user_id ${record['user_id']}
    # ... other fields
  </record>
</filter>

<filter auth_service.log>
  @type record_transformer
  enable_ruby true
  <record>
    # Sample INFO level user_login events
    # If event_type is user_login and level is INFO, keep only if random < 0.01
    # Otherwise, keep the log.
    # This is a conceptual representation; actual implementation varies by plugin.
    # A more robust approach is often to use separate parsers/filters for different events.
  </record>
</filter>

<filter **>
  @type stdout # or your backend like Elasticsearch
</filter>

A more practical sampling approach often involves probabilities applied at the source, or using a dedicated sampling processor. For example, in a system using Kafka, you might have a Kafka Streams application that consumes logs, applies sampling logic, and then produces the sampled logs to a new topic.

The key is that the trace_id must be universally present and consistently generated. If you have a distributed trace ID, you can reconstruct the path of a request through multiple services. When an error occurs, you search for logs containing that trace_id and the ERROR level. Because errors are not sampled, you’ll get all the logs for that specific error trace.

What most people don’t realize is that the distribution of sampled logs can be as important as the logs themselves. If your sampling mechanism is biased (e.g., it samples more from one server than another due to network latency or load), you might miss critical events occurring disproportionately on that server. Ensuring your sampling is truly random across all instances and events is crucial for its effectiveness.

The next hurdle you’ll encounter is managing log retention policies for different log levels and event types, ensuring you keep critical errors indefinitely while purging less important information.

Related Concepts

More Deep Dives in Reliability Engineering (SRE)