The fundamental trick of Vector Sample Transform is that it throws away most of your logs, but makes it look like it didn’t.
Here’s how it actually works:
Let’s say you have a pipeline processing logs. You’ve got a vector.toml config like this:
[sources.my_source]
type = "file"
include = ["/var/log/app.log"]
[transforms.sample]
type = "sample"
inputs = ["my_source"]
# Keep 1 out of every 1000 logs
ratio = 0.001
[sinks.my_sink]
type = "blackhole"
inputs = ["sample"]
When Vector runs, it reads from /var/log/app.log. For each log line that comes through, it generates a random number between 0 and 1. If that number is less than 0.001 (your ratio), it keeps the log. Otherwise, it discards it. The blackhole sink then receives this sampled stream.
This is incredibly useful for reducing the sheer volume of logs sent to storage or analysis systems. Instead of paying to store and process 100% of your logs, you might only be storing 0.1%. This can drastically cut costs and reduce noise in your monitoring dashboards, allowing you to focus on the truly exceptional events.
The core problem it solves is the cost and complexity of handling massive log volumes. Imagine an application generating millions of log lines per minute. Storing all of that, indexing it, and querying it can become prohibitively expensive and slow. Sampling allows you to maintain a representative subset for debugging and anomaly detection without the overhead of the full dataset.
Here’s a slightly more complex scenario. You want to sample logs before enriching them, but only send the sampled, enriched logs to your destination.
[sources.my_source]
type = "file"
include = ["/var/log/app.log"]
[transforms.sample]
type = "sample"
inputs = ["my_source"]
ratio = 0.01 # Keep 1% for now
[transforms.enrich]
type = "remap"
inputs = ["sample"]
source = '''
.hostname = get_hostname()
.timestamp_ns = now_ns()
'''
[sinks.my_sink]
type = "blackhole"
inputs = ["enrich"]
In this setup, Vector first reads the logs from my_source. The sample transform then applies the 1% sampling. Only those 1% of logs proceed to the enrich transform, where hostname and timestamp_ns are added. Finally, this reduced, enriched stream goes to my_sink. This ensures that enrichment computations are only performed on the subset of logs you intend to keep, saving CPU cycles.
The key lever you control is the ratio parameter. A ratio of 1.0 means no sampling (keep everything). A ratio of 0.0 means discard everything. Values between 0.0 and 1.0 determine the fraction of logs that will pass through. You can also use rate instead of ratio, which specifies the number of events per second to allow, effectively controlling the absolute number of logs rather than a percentage.
The most surprising thing is that even with a very low ratio, like 0.00001 (1 in 100,000), the sampled logs often provide enough context for debugging. The random nature of the sampling ensures that you’re not systematically excluding any particular type of log message, as long as your log volume is high enough to achieve a statistically significant sample. This means you can often find the needle in the haystack without looking at the entire haystack.
If you’re sampling at a high rate, say 0.5, and you observe that you’re still getting too many logs, you might consider using a more aggressive ratio or exploring transforms that filter based on specific log content before sampling.