The most surprising thing about Splunk Observability APM’s profiling and tracing setup is how little you actually need to do to get meaningful data.
Let’s see it in action. Imagine you have a simple Python Flask app.
from flask import Flask
import time
import random
app = Flask(__name__)
def slow_operation():
time.sleep(random.uniform(0.1, 0.5))
def another_operation():
time.sleep(random.uniform(0.05, 0.2))
@app.route('/')
def hello_world():
slow_operation()
another_operation()
return 'Hello, World!'
if __name__ == '__main__':
app.run(debug=True, port=5000)
To get APM data, you’d typically install the Splunk OpenTelemetry Collector and configure it to receive data from your application. Your application itself needs to be instrumented. For Python, this often involves the splunk-otel-python library.
First, install the necessary packages:
pip install flask splunk-otel
Then, modify your application to include the OpenTelemetry instrumentation. This is usually done by setting environment variables before your application starts.
export SPLUNK_OBSERVABILITY_ACCESS_TOKEN="YOUR_REALM.YOUR_ACCESS_TOKEN"
export SPLUNK_PROFILER_ENABLED="true"
export SPLUNK_TRACE_ENABLED="true"
export SPLUNK_METRICS_ENABLED="true" # Often enabled by default but good to be explicit
export SPLUNK_SERVICE_NAME="my-flask-app" # Crucial for identifying your service
# For Python, you'd typically run your app using the otel-python runner
# This injects the instrumentation automatically.
# If not using the runner, you'd manually initialize the tracer.
# Example using the runner:
# python -m splunk_otel_python -m flask run --port 5000
When you run your application with these environment variables set, the splunk-otel Python agent automatically hooks into your Flask application. It intercepts incoming requests, outgoing HTTP calls, database queries (if configured), and function calls. It then generates traces and profiles, sending them to the Splunk Observability Cloud via the OpenTelemetry Collector.
Here’s how the mental model breaks down:
-
Instrumentation: This is the core. Libraries like
splunk-otel-python(or similar for Java, Node.js, Go, etc.) act as agents. They attach to your running application and "listen" for specific events: function calls, network I/O, database interactions. For tracing, they create spans representing these operations and link them together to form a trace. For profiling, they periodically capture the call stack of your running threads to show where CPU time is being spent. -
Data Collection (Collector): The instrumented application doesn’t send data directly to Splunk Observability Cloud. Instead, it sends it to a local or remote Splunk OpenTelemetry Collector. This collector acts as a gateway. It can receive data in various formats (like OTLP), process it (filter, batch, add metadata), and then export it to its final destination.
-
Exporting to Splunk: The Collector is configured to send the processed traces, profiles, and metrics to your Splunk Observability Cloud realm. This is where your access token comes in.
-
Visualization & Analysis: In Splunk Observability Cloud, you can then explore these traces. You’ll see a timeline of requests, with each segment representing a span. You can drill down into spans to see their duration, any associated logs, and importantly, for profiled services, you can view the flame graph that shows exactly which functions consumed CPU time during that request.
The real power comes when you start looking at distributed traces. If your Flask app calls another microservice, and that microservice calls a database, the trace will span all these hops. You can instantly see which service is the bottleneck. The profiling data then tells you why that service is slow – is it an inefficient algorithm, excessive I/O, or something else entirely?
A common point of confusion is understanding the difference between tracing and profiling. Tracing shows you the path of a request and the duration of each step. Profiling shows you what your code is doing during a specific period, especially regarding CPU usage. You can have a very fast trace with a high CPU profile if the code executing within a span is inefficient. Conversely, a slow trace might have a relatively flat profile if the time is spent waiting for external resources (like network I/O or database queries) rather than actively computing.
The Splunk OpenTelemetry Collector is incredibly flexible. You can run it as a standalone binary, as a Docker container, or even as a Kubernetes DaemonSet or Deployment. Its configuration (otel-collector-config.yaml) defines how it receives, processes, and exports telemetry data. You can add processors for tail-based sampling, resource detection, and exporters for various backends, including Splunk.
When you’re setting up profiling, ensure your application’s runtime environment supports it. For Java, this might involve JVM arguments. For Python, the splunk-otel-python agent handles much of this, but it’s important to note that profiling can add overhead, so it’s typically enabled selectively or during specific testing phases.
The next logical step after getting basic tracing and profiling set up is to configure advanced sampling strategies and integrate with your logging infrastructure.