Splunk ITSI’s KPI setup isn’t just about defining metrics; it’s about encoding business impact into your monitoring.
Let’s watch a service health KPI come alive. Imagine we’re monitoring the "E-commerce Checkout Service." A critical KPI for this would be "Checkout Completion Rate."
Here’s what that might look like in Splunk ITSI:
-
Data Source:
index=weblogs sourcetype=apache_access(or whatever your web server logs are) -
Base Search:
index=weblogs sourcetype=apache_access status=200 | stats count as total_checkoutsThis counts every successful request to our checkout endpoint.
-
KPI Thresholds:
- Critical:
< 95% - Warning:
95% - 98% - Info:
> 98%
- Critical:
-
Calculation:
- We need to compare
total_checkoutsto the total number of checkout attempts. This implies we need another event type or status code to represent a failed checkout. Let’s assume astatus=500orstatus=503represents a failed checkout attempt.
The full KPI search would look something like:
| tstats count from datamodel=weblogs.apache_access where nodename.status IN (200, 500, 503) by nodename.status | eval total_attempts = sum(count) | eval success_attempts = values(count) WHERE nodename.status=200 | eval completion_rate = round((success_attempts / total_attempts) * 100, 2) | fields completion_rate(Note: For actual ITSI KPI setup, you’d typically use the ITSI UI to define the data sources, fields, and aggregation logic, which then generates a search like this. The
tstatscommand is often used for performance in ITSI.) - We need to compare
This KPI tells us not just if the checkout endpoint is responding, but how often users are actually completing the checkout flow. A high error rate might mean the service is technically up but functionally broken for customers.
The mental model ITSI builds around KPIs is hierarchical. A KPI like "Checkout Completion Rate" rolls up into an "E-commerce Checkout Service" entity. This entity, in turn, might roll up into a higher-level "E-commerce Platform" service. This allows you to see a single, high-level health score (e.g., "E-commerce Platform is Yellow") and then drill down through the contributing services and KPIs to pinpoint the root cause.
The "Service Health Score" for an entity is typically an aggregation of its contributing KPIs. ITSI uses weighted averages based on the criticality you define for each KPI. A "Checkout Completion Rate" might be weighted higher than "Average Response Time" because a failed checkout has a more direct business impact.
What most people don’t realize is that the "aggregation" of KPIs into a service score isn’t just a simple average. ITSI allows for complex aggregation rules. You can define that if any single critical KPI drops below its threshold, the entire service health score immediately becomes critical, regardless of other KPIs. This mimics real-world business impact where a single failure point can render a whole service unusable.
The next concept you’ll likely encounter is correlating these service health KPIs with actual business metrics, like revenue or order volume, to prove the direct financial impact of IT performance.