Vector’s internal metrics are the system’s way of telling you what’s happening inside itself, rather than just what data it’s processing.
Let’s see it in action. Imagine you’re running Vector, and you want to see how many events it’s received and how many it’s successfully sent. You’d typically expose these metrics via an HTTP endpoint that Prometheus can scrape.
Here’s a common configuration snippet for Vector to expose its metrics:
[api]
enabled = true
listen_address = "127.0.0.1:8080"
metrics_path = "/metrics"
With this in place, if you curl http://127.0.0.1:8080/metrics, you’ll see output like this:
# HELP vector_agent_heartbeat_seconds_since_last_check Agent heartbeat in seconds since last check.
# TYPE vector_agent_heartbeat_seconds_since_last_check gauge
vector_agent_heartbeat_seconds_since_last_check 0.001
# HELP vector_agent_running_seconds_total Total seconds the agent has been running.
# TYPE vector_agent_running_seconds_total counter
vector_agent_running_seconds_total 86400.5
# HELP vector_byte_counter_bytes_total Total bytes processed by Vector.
# TYPE vector_byte_counter_bytes_total counter
vector_byte_counter_bytes_total{direction="in",name="my_source"} 10000000
vector_byte_counter_bytes_total{direction="out",name="my_transform"} 9950000
vector_byte_counter_bytes_total{direction="out",name="my_sink"} 9900000
# HELP vector_event_counter_events_total Total events processed by Vector.
# TYPE vector_event_counter_events_total counter
vector_event_counter_events_total{direction="in",name="my_source"} 500000
vector_event_counter_events_total{direction="out",name="my_transform"} 498000
vector_event_counter_events_total{direction="out",name="my_sink"} 495000
# HELP vector_flush_durations_seconds_sum Sum of flush durations in seconds.
# TYPE vector_flush_durations_seconds_sum summary
vector_flush_durations_seconds_sum{name="my_sink"} 123.45
# HELP vector_flush_durations_seconds_count Count of flushes.
# TYPE vector_flush_durations_seconds_count summary
vector_flush_durations_seconds_count{name="my_sink"} 1000
These metrics help you understand the throughput, latency, and overall health of your Vector pipeline. The vector_event_counter_events_total and vector_byte_counter_bytes_total are your primary indicators of data flow. The direction="in" and direction="out" labels tell you if the count is for data entering a component or leaving it. The name label corresponds to the name you give your sources, transforms, and sinks in your Vector configuration.
For example, vector_event_counter_events_total{direction="in",name="my_source"} shows the total number of events that have entered the component named my_source. Similarly, vector_event_counter_events_total{direction="out",name="my_sink"} shows the total number of events that have successfully exited my_sink. The difference between these two for a given path in your pipeline can indicate dropped events or processing delays.
The vector_flush_durations_seconds_sum and vector_flush_durations_seconds_count metrics are crucial for understanding latency. By dividing the sum by the count, you get the average duration of a flush operation for a specific component, like my_sink. A high average flush duration means that component is taking a long time to process and send its data.
The vector_agent_heartbeat_seconds_since_last_check metric is a simple gauge that indicates how recently Vector checked its own internal state. If this value starts to increase significantly, it might suggest that Vector is struggling to perform even basic internal tasks, which is a sign of deeper issues. vector_agent_running_seconds_total just tells you how long the agent has been up.
A key insight is that Vector’s internal metrics are often aggregated at the component level. This means that if you have multiple instances of the same component type (e.g., two file sources), their metrics will be combined under that component’s name. To get more granular metrics, you’d typically use Vector’s clone transform to create distinct paths with unique component names, allowing you to isolate and monitor specific data flows.
Many users overlook the vector_healthcheck_failure_total metric. This counter increments whenever a health check performed by Vector on one of its internal components fails. While this might seem like a low-level detail, a significant increase here can be an early warning sign that a particular component is becoming unstable or is unable to perform its essential operations, potentially leading to data loss or pipeline stagnation before more obvious symptoms appear.
When you start monitoring these metrics, you’ll begin to see patterns related to backpressure. If a sink is slow to acknowledge received data, the out counters for upstream components will lag behind their in counters, and flush durations will increase. This is Vector’s way of signaling that data is backing up.
The next step in understanding Vector’s performance is to dive into the specifics of how it handles backpressure and how you can tune it.