Monitoring isn’t just about knowing if your service is up; it’s about understanding why it’s up and how it’s performing from every angle.

Let’s look at a typical web service. Imagine a user request coming in.

# User Request Flow
1. User's browser -> DNS lookup
2. DNS server -> IP address
3. Browser -> HTTP request to IP address
4. Load Balancer -> Forwards to Web Server
5. Web Server -> Processes request, queries Database
6. Database -> Returns data
7. Web Server -> Generates response
8. Load Balancer -> Returns response to User

We can monitor this from the outside, pretending we’re the user. This is blackbox monitoring. We’re hitting the service’s public interface and measuring response times, success rates, and availability. Think of it like calling your bank from your cell phone and timing how long it takes to get through to a representative.

Here’s a simple blackbox check using curl from an external machine:

curl -o /dev/null -s -w "HTTP_CODE:%{http_code}  TTFB:%{time_starttransfer}\n" https://your-service.example.com/health

This command checks the /health endpoint of your-service.example.com.

  • -o /dev/null: Discards the actual output of the health check.
  • -s: Silent mode, doesn’t show progress or errors.
  • -w "HTTP_CODE:%{http_code} TTFB:%{time_starttransfer}\n": Writes out the HTTP status code and the Time To First Byte (TTFB).

A typical monitoring setup would run this every minute from multiple geographic locations and alert if the HTTP code isn’t 200 or if TTFB exceeds, say, 500ms. This tells you if the service is globally reachable and responsive to external users.

But what happens if that check fails? Blackbox monitoring tells you that something is wrong, but not where. Is the load balancer misconfigured? Is the web server overloaded? Is the database slow? We need to look inside the system. This is whitebox monitoring.

Whitebox monitoring involves instrumenting the application and its underlying infrastructure to expose internal metrics. For our web service, this means collecting data from:

  • Web Servers: Request rates, error rates (4xx, 5xx), latency per endpoint, CPU/memory usage, network I/O.
  • Databases: Query latency, connection counts, disk I/O, replication lag, CPU/memory usage.
  • Load Balancers: Backend health, active connections, request distribution.
  • Application Code: Specific business logic metrics, e.g., "number of orders processed per second," "user login failures."

Let’s say our web server is Nginx. We can enable the stub_status module to get basic metrics:

# In your nginx.conf or a site-specific config
server {
    listen 80;
    server_name your-service.example.com;

    location /nginx_status {
        stub_status on;
        access_log off;
        allow 127.0.0.1; # Allow access from localhost for metrics scraping
        deny all;
    }

    # ... other configurations
}

A Prometheus exporter (like nginx-exporter) can then scrape http://localhost/nginx_status and expose metrics like nginx_http_requests_total, nginx_connections_active, etc., to a central Prometheus server.

From the application itself, we can use libraries like Prometheus client libraries (available for Go, Python, Java, etc.) to expose custom metrics. For example, in Python:

from prometheus_client import start_http_server, Counter, Summary
import time
import random

# Expose metrics on port 8000
start_http_server(8000)

# Define metrics
REQUEST_COUNT = Counter('my_app_requests_total', 'Total number of requests received.')
REQUEST_LATENCY = Summary('my_app_request_latency_seconds', 'Time taken to process a request.')

def process_request():
    REQUEST_COUNT.inc()
    start_time = time.time()
    # Simulate work
    time.sleep(random.uniform(0.1, 0.5))
    REQUEST_LATENCY.observe(time.time() - start_time)

while True:
    process_request()

This Python application, running on the same server as the web service, exposes its own metrics at http://localhost:8000/metrics. Prometheus can scrape these, giving us visibility into the application’s internal performance.

The real power comes from combining these. If blackbox monitoring shows an increase in user-facing latency, whitebox metrics can pinpoint the bottleneck. Are Nginx request rates spiking? Is the my_app_request_latency_seconds summary showing high percentiles? Is the database pg_stat_statements showing a particular slow query? This allows for targeted debugging and optimization.

A common misconception is that whitebox monitoring means instrumenting every single line of code. This is rarely necessary or practical. Instead, focus on key performance indicators (KPIs) and critical paths within your application and infrastructure. Think about the metrics that directly correlate with user experience and system health.

When you’re diving into whitebox metrics, look for trends and correlations. If the total request count from Nginx is flat, but the application’s REQUEST_LATENCY is climbing, the problem is likely within the application code or its dependencies, not Nginx itself. If database connections are maxed out, you need to investigate query efficiency or connection pooling.

When you’ve successfully implemented both blackbox and whitebox monitoring, the next challenge is often setting up effective alerting that leverages both types of data to avoid alert fatigue and ensure timely incident response.

Want structured learning?

Take the full Sre course →