The most surprising thing about the SRE tools ecosystem is how much of it you can build yourself, and how often the "best" tool is the one you’ve already got but are using slightly differently.

Let’s look at a typical SRE workflow: incident detection, alert triage, diagnosis, remediation, and post-mortem analysis. Each of these stages has a rich set of tools, both open-source and commercial.

Incident Detection & Alerting

Imagine a service experiencing high latency. You need to know before users start complaining.

  • Open Source: Prometheus is the de facto standard for metrics collection and alerting. You’d configure exporters (like node_exporter for host metrics, kube-state-metrics for Kubernetes object states, or custom application exporters) to expose metrics. Prometheus then scrapes these targets. Alerting rules are defined in YAML, like this:

    groups:
    - name: HostAlerts
      rules:
      - alert: HighCpuLoad
        expr: node_load1 > 5
        for: 5m
        labels:
          severity: warning
        annotations:
    
          summary: "High load on {{ $labels.instance }}"
    
    
          description: "CPU load is {{ $value }} on {{ $labels.instance }} for 5 minutes."
    
    

    Alertmanager then deduplicates, groups, and routes these alerts to your chosen notification channels (Slack, PagerDuty, email).

  • Commercial: Datadog, New Relic, Dynatrace offer integrated platforms. You install an agent, and it automatically discovers services, collects metrics, traces, and logs, and provides sophisticated alerting with anomaly detection. For example, in Datadog, you might set up an alert like: avg(last_5m) of system.cpu.user{env:prod} by {host} > 80.

Triage & Diagnosis

An alert fires. Now what? You need context.

  • Open Source:

    • Logs: Elasticsearch, Logstash, and Kibana (ELK stack) or its newer iteration, the Elastic Stack, are common. Fluentd or Vector are often used as log shippers. You’d query Kibana for logs related to the affected service and time window: service:my-app AND level:ERROR AND @timestamp:[now-15m TO now].
    • Tracing: Jaeger or Zipkin allow you to visualize the request flow through your distributed system. If a request to service-A is slow, tracing shows if it’s waiting on service-B or service-C.
    • Metrics (again): Grafana is often used with Prometheus (or InfluxDB, etc.) to create rich dashboards that correlate metrics like CPU, memory, network I/O, and application-specific metrics. You’d look for spikes or drops in these dashboards around the time of the alert.
  • Commercial: These platforms unify logs, metrics, and traces. Datadog’s APM (Application Performance Monitoring) can automatically link an alert to a slow transaction, show its trace, and then let you drill down into the logs for that specific trace. This significantly reduces MTTR (Mean Time To Resolution).

Remediation

The problem is identified. Time to fix it.

  • Open Source:

    • Configuration Management: Ansible, Chef, Puppet are used to automate infrastructure changes. If a service needs more memory, an Ansible playbook might update the Kubernetes deployment or cloud provider configuration.
    • Orchestration: Kubernetes itself is a massive remediation tool. kubectl rollout restart deployment/my-app or kubectl scale deployment/my-app --replicas=5 are common commands.
    • Runbooks: Often documented in a wiki or Git repository, these are step-by-step guides for common issues. They might involve running specific commands, checking configuration files, or restarting services.
  • Commercial: Some platforms offer "auto-remediation" where they can trigger automated actions based on alerts. PagerDuty, for example, can integrate with Ansible or custom scripts to run remediation actions when an alert is triggered.

Post-Mortem Analysis

What happened? Why? How do we prevent it?

  • Open Source: A combination of tools. You’d pull data from Prometheus, Grafana, ELK, Jaeger, and your Git history. Tools like git log and kubectl history can show what changes were deployed. The post-mortem itself is often a document (e.g., in Confluence, Google Docs, or a Markdown file in Git).
  • Commercial: Platforms like PagerDuty or incident management tools often have built-in post-mortem templates and collaboration features. They can also link directly to the relevant metrics, logs, and traces captured during the incident, making analysis much faster.

The SRE tools ecosystem is a patchwork. Many organizations start with Prometheus/Grafana/ELK and then layer commercial solutions on top for specific needs (e.g., advanced APM, unified observability, or better incident management). The key is understanding the workflow and choosing tools that integrate well, whether they’re open-source, commercial, or even custom-built scripts.

What many people don’t realize is the power of using something like kubectl events in conjunction with your metrics and logs. When an alert fires on a Kubernetes cluster, the first place to look isn’t always application logs. kubectl get events --field-selector involvedObject.name=my-app-pod-xyz --sort-by='.lastTimestamp' can reveal crucial system-level issues like OOMKilled events, scheduling failures, or network policy rejections that are directly impacting your application’s health and are often the root cause of the alert.

The next logical step in this journey is often exploring chaos engineering, where you proactively inject failures to test your system’s resilience and the effectiveness of your SRE tooling.

Want structured learning?

Take the full Reliability Engineering (SRE) course →