Prometheus data is surprisingly bad at telling you when something started failing, even though it excels at telling you that it failed.

Let’s watch this in action. Imagine you’re tracking request latency for a service. You’ve got Prometheus scraping your application, and you’re visualizing it in Grafana.

http_requests_total{method="POST", handler="/api/v1/users", status="200"}

This is a counter. It goes up. When you look at rate(http_requests_total[5m]), you’re seeing the average rate of requests over the last five minutes. If a blip of errors happened, say, three minutes ago, and the rate has since recovered, your rate() graph might look perfectly fine. You’ve lost the signal in the averaging.

rate(http_requests_total{method="POST", handler="/api/v1/users", status="500"}[5m]) > 0

This query tells you if there were any 500 errors in the last five minutes, averaged. It doesn’t tell you how many or when they started.

The problem is that Prometheus counter types are designed for accumulation, not for capturing discrete events or their precise start times. When you aggregate them with functions like rate() or increase(), you’re smoothing out the very information you often need for root-cause analysis: the exact moment a problem began.

To truly understand the onset of an issue, you need a data model that preserves the individual data points and their timestamps with high fidelity, and allows for querying based on specific event occurrences rather than aggregated rates. This is where a time-series database like TimescaleDB shines, by allowing you to store Prometheus metrics (or metrics like them) in a way that doesn’t lose this critical temporal granularity.

Here’s how you’d model this in TimescaleDB. You’d create a standard PostgreSQL table, but then transform it into a TimescaleDB hypertable.

CREATE TABLE http_requests (
    time TIMESTAMPTZ NOT NULL,
    method TEXT,
    handler TEXT,
    status TEXT,
    count BIGINT
);

SELECT create_hypertable('http_requests', 'time');

Now, instead of just storing the scraped counter values, you can ingest data in a way that represents changes or individual observations. For instance, you might store deltas, or even individual request outcomes. If you were to ingest data that represents each request:

INSERT INTO http_requests (time, method, handler, status)
VALUES (NOW(), 'POST', '/api/v1/users', '200');

Or, if you’re ingesting aggregated Prometheus data, you’d store the scrape timestamp and the delta observed since the last scrape. The key is that your time column is a first-class citizen and the table is partitioned by time.

To find when 500 errors started occurring, you wouldn’t use rate(). You’d query for the first occurrence of a 500 status code within a specific time window.

SELECT time
FROM http_requests
WHERE status = '500'
ORDER BY time ASC
LIMIT 1;

This query directly asks: "When was the absolute earliest time we saw a 500 status code?" This is a fundamental distinction from Prometheus’s rate() which tells you the average rate over a period.

You can also do much more complex temporal analysis. For example, to find the exact time a surge in 500 errors began, you might look for a point where the number of 500s within a small window exceeds a threshold, and then find the first instance of that condition.

WITH error_counts AS (
    SELECT
        time,
        COUNT(*) OVER (ORDER BY time RANGE BETWEEN INTERVAL '1 minute' PRECEDING AND CURRENT ROW) as count_in_window
    FROM http_requests
    WHERE status = '500'
)
SELECT time
FROM error_counts
WHERE count_in_window > 10 -- Example threshold: more than 10 errors in a minute
ORDER BY time ASC
LIMIT 1;

This query identifies the start of a distinct burst of errors, not just a general increase in error rate. The RANGE BETWEEN INTERVAL '1 minute' PRECEDING AND CURRENT ROW clause is a window function that efficiently calculates the count of errors within a rolling one-minute window, allowing you to pinpoint the precise moment a problematic trend began.

Most people struggle with Prometheus when they need to answer "When did this start?" rather than "What’s the current rate?" The fundamental difference lies in how rate() aggregates data versus how a relational query on a time-series database can pinpoint exact event timestamps.

The next challenge you’ll face is effectively ingesting and transforming your Prometheus metrics into this more granular, event-oriented model within TimescaleDB.

Want structured learning?

Take the full Timescaledb course →