OpenTelemetry in Splunk Kubernetes Monitoring isn’t just about sending metrics; it’s about tracing requests as they hop between services in your cluster, giving you a granular, end-to-end view of application behavior that traditional monitoring misses.
Let’s watch it in action. Imagine a user request hitting your Kubernetes cluster. It lands on an ingress controller, which routes it to a frontend service. That frontend service then calls a backend API, which might itself call a database or another microservice. Without distributed tracing, if that request slows down or fails, you’re left guessing which of those hops is the culprit.
Here’s what that looks like in Splunk Observability Cloud. We’ve configured our Kubernetes cluster to export traces using OpenTelemetry.
# Example deployment for an OpenTelemetry Collector
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
template:
spec:
containers:
- name: otel-collector
image: otel/collector:latest
command:
- "/otelcol-contrib"
- "--config=/etc/otel-collector-config/collector-config.yaml"
volumeMounts:
- name: otel-collector-config-vol
mountPath: /etc/otel-collector-config
volumes:
- name: otel-collector-config-vol
configMap:
name: otel-collector-config
And the collector-config.yaml might look like this:
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
memory_limiter:
check_interval: 1s
limit_percentage: 10
spike_limit_percentage: 20
exporters:
splunk_hec:
token: "YOUR_SPLUNK_HEC_TOKEN"
host: "your-splunk-instance.splunkcloud.com"
port: 8088
send_batch_size: 1000
tls:
insecure_skip_verify: true # Set to false in production with proper certs
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [splunk_hec]
When a request flows through your instrumented applications (using OpenTelemetry SDKs in your code), it generates spans. Each span represents an operation (like an HTTP request, a database query, or a function call) and contains metadata like its start time, duration, name, and any associated attributes (e.g., HTTP method, URL, user ID). These spans are linked together by trace IDs and parent-child relationships, forming a complete trace that visualizes the entire journey of a request.
In Splunk, you’ll see these traces displayed as waterfalls. Each colored bar is a span, showing its duration and where it occurred in the request flow. You can click on any span to see its detailed attributes, logs, and metrics associated with that specific operation. This allows you to pinpoint latency issues – is the frontend slow to respond, or is it waiting on a slow backend? Is the database query taking too long?
The core problem this solves is the "distributed monolith" scenario. As you break down monolithic applications into microservices, you gain flexibility but lose the inherent visibility of a single process. Distributed tracing bridges this gap by stitching together the activity across multiple services, making complex, distributed systems observable.
The system in action: You deploy your microservices, each instrumented with an OpenTelemetry SDK. You deploy an OpenTelemetry Collector (as shown above) to receive, process, and export the telemetry data. The collector is configured to send traces to your Splunk Observability Cloud instance. Splunk then indexes and visualizes this data, allowing you to query and analyze traces based on service name, operation name, attributes, duration, and more.
The exact levers you control are primarily in your application instrumentation and the OpenTelemetry Collector configuration. For application instrumentation, you choose which libraries to use (e.g., opentelemetry-java-instrumentation, opentelemetry-python) and configure them to auto-instrument common frameworks or manually add spans for critical code paths. For the collector, you tune receivers (e.g., OTLP, Jaeger), processors (e.g., batch, memory_limiter, attributes), and exporters (e.g., splunk_hec, prometheus).
One aspect that often surprises people is how much context you can enrich a trace with. Beyond just the service and operation, you can add arbitrary attributes to spans: Kubernetes pod names, deployment versions, user IDs, tenant IDs, feature flags, error codes. This rich attribute set is crucial for effective filtering and debugging. For example, you might filter traces to only show requests from a specific user, hitting a particular version of a service, or experiencing a certain error type. This isn’t just about seeing the flow; it’s about seeing the flow under specific conditions that matter to your business.
Once you’ve mastered tracing, the next logical step is correlating that trace data with metrics and logs from the same services and requests.