The most surprising thing about vector deduplication is that it doesn’t actually remove events from your logs; it just stops sending duplicates.
Let’s see this in action. Imagine you have a simple Vector pipeline that tails a file and prints to stdout.
[sources.my_source]
type = "file"
include = ["/tmp/input.log"]
[transforms.dedupe]
type = "dedupe"
inputs = ["my_source"]
# We'll deduplicate based on the entire event for now
# field = ["message"] # or specify a field, or multiple fields
[sinks.my_sink]
type = "stdout"
inputs = ["dedupe"]
Now, let’s create a file with some duplicate lines:
echo "event 1" > /tmp/input.log
echo "event 2" >> /tmp/input.log
echo "event 1" >> /tmp/input.log # Duplicate
echo "event 3" >> /tmp/input.log
echo "event 2" >> /tmp/input.log # Duplicate
When Vector runs with the configuration above, you’ll see:
{timestamp: ..., message: "event 1"}
{timestamp: ..., message: "event 2"}
{timestamp: ..., message: "event 3"}
Notice that "event 1" and "event 2" appear only once, even though they were in the input file twice. The dedupe transform is holding onto the first occurrence and discarding subsequent identical events.
The core problem Vector’s dedupe transform solves is the overwhelming noise and cost associated with processing and storing redundant log data. Many systems generate identical log messages repeatedly under normal operation, especially under load or during retries. Sending and storing these duplicates inflates metrics, makes searching harder, and can incur unnecessary costs in downstream systems like S3 or Elasticsearch.
Internally, the dedupe transform maintains a set of unique event identifiers within a configurable time window. When an event arrives, its identifier is generated based on the fields you specify. If this identifier has been seen within the window, the event is dropped. If it’s new, or if the previous occurrence fell outside the window, the event is passed downstream, and its identifier is added to the set.
The primary lever you control is the mode and window configuration.
mode: This determines what constitutes a duplicate.all: The default. The entire event is hashed to create the identifier. If two events are byte-for-byte identical (after potential internal normalization), they are considered duplicates.fields: You specify a list offields. The values of these fields are concatenated and hashed to create the identifier. This is common for deduplicating based on a specific message or transaction ID.
window: This is a duration (e.g.,"5m","1h") that defines how long an event’s identifier remains "seen." If an identical event arrives after this window has passed since the first occurrence, it will be treated as a new, unique event. This prevents infinite deduplication and allows for eventual reprocessing of events if needed.
Here’s an example using mode = "fields":
[sources.my_source]
type = "file"
include = ["/tmp/input.log"]
[transforms.dedupe]
type = "dedupe"
inputs = ["my_source"]
mode = "fields"
field = ["message"] # Deduplicate only based on the 'message' field
window = "1m" # Keep identifiers for 1 minute
[sinks.my_sink]
type = "stdout"
inputs = ["dedupe"]
If /tmp/input.log contained:
{"message": "request 123 started", "user": "alice"}
{"message": "request 123 processed", "user": "alice"}
{"message": "request 123 started", "user": "bob"} # Different user, but same message
With the configuration above, the first and third lines would not be deduplicated because the message field is different. However, if you had another line:
{"message": "request 123 started", "user": "charlie"} # Same message again
This fourth line would be deduplicated with the first line if it arrived within the 1m window, because the message field ("request 123 started") is identical.
The dedupe transform is stateful. It needs to remember what it has seen. This state is stored in memory by default. If Vector restarts, this in-memory state is lost, and deduplication will effectively reset. For long-running, critical deduplication, especially across restarts, you’ll want to configure memory or disk storage for the transform’s state. The memory option stores state in RAM, and the disk option persists it to disk, allowing for state recovery after a restart.
[transforms.dedupe]
type = "dedupe"
inputs = ["my_source"]
mode = "fields"
field = ["message"]
window = "5m"
# Configure state storage for resilience
state = { type = "disk", path = "/var/lib/vector/dedupe_state" }
The next challenge you’ll likely encounter is handling events that are almost duplicates but differ by a timestamp or a generated ID that you want to ignore for deduplication purposes.