Vector’s configuration is all about defining how data flows through it, from where it starts to where it ends up, and what happens in between.

Let’s see Vector in action. Imagine we have a simple setup where we’re collecting logs from a file, transforming them to add a timestamp, and then sending them to stdout so we can see them.

# vector.toml
[sources.my_logs]
type = "file"
include = ["/tmp/my_app.log"]

[transforms.add_timestamp]
type = "remap"
inputs = ["my_logs"]
source = '''
.timestamp = now()
'''

[sinks.my_stdout]
type = "console"
inputs = ["add_timestamp"]
encoding.codec = "json"

When we run vector --config vector.toml, and then write some JSON lines to /tmp/my_app.log:

{"message": "User logged in", "user_id": "abc"}
{"message": "Item added to cart", "item_id": "xyz"}

Vector will output this to our terminal:

{"timestamp": "2023-10-27T10:30:00.123Z", "message": "User logged in", "user_id": "abc"}
{"timestamp": "2023-10-27T10:30:01.456Z", "message": "Item added to cart", "item_id": "xyz"}

Notice how each log line now has a timestamp field added by the remap transform. This is the core idea: sources ingest data, transforms process it, and sinks export it.

The real power comes from chaining these components together. A source can feed multiple transforms, and a transform can receive input from multiple sources. Likewise, a sink can consume from multiple transforms. This creates a directed acyclic graph (DAG) of data processing.

Here’s how the components fit together:

  • Sources: These are the entry points for your data. Vector has a rich set of sources, from reading files (file source), listening on network ports (tcp, http), consuming from message queues (kafka, aws_sqs), to generating metrics (prometheus_remote_write). Each source has specific configuration options for how it collects data. For the file source, include and exclude patterns are crucial for defining which files to watch. For network sources, address and port are key.

  • Transforms: These are the workhorses that modify, enrich, filter, or aggregate your data as it flows through Vector. Vector offers a variety of transform types:

    • remap: For custom logic using Vector’s expression language. This is incredibly flexible for adding fields, renaming, dropping, or conditionally altering data.
    • filter: To drop events that don’t meet certain criteria.
    • route: To direct events to different sinks or transforms based on their content.
    • aggregate: To group and summarize events over a time window.
    • enrichment: To add data from external sources like GeoIP databases or static files.
    • And many more, like json_parser, regex_replacer, sampler, etc. The inputs field in a transform is vital; it specifies which source(s) or preceding transform(s) it should receive data from.
  • Sinks: These are the destinations for your processed data. Vector can send data to a vast array of services: cloud object storage (aws_s3, gcs), databases (elasticsearch, clickhouse), logging platforms (datadog, splunk), message queues (kafka, redis), and of course, the console (console). The configuration for sinks depends heavily on the destination, often requiring connection details, authentication credentials, and formatting options. The encoding block is particularly important for sinks, allowing you to specify how your data should be serialized (e.g., json, text, protobuf).

The inputs directive is where the magic of connecting these components happens. A component will only process data that is explicitly sent to it. If a transform is named enrich_with_geoip and it has inputs = ["my_logs"], it means it will receive all data that flows out of the my_logs source. If you want it to also receive data from another transform, say parse_user_agent, you would change it to inputs = ["my_logs", "parse_user_agent"].

The expression language used in transforms like remap is powerful. You can access event fields using dot notation (e.g., .message, .user_id). You can use built-in functions like now(), parse_json(), to_uppercase(), and conditional logic (if/else). This allows for complex data manipulation without needing to write external scripts or services.

When you define a component, you give it a name (e.g., my_logs, add_timestamp, my_stdout). These names are arbitrary but must be unique within the configuration file. The inputs field then uses these names to build the data flow graph. If you leave inputs unspecified for a source, it’s considered a root of the graph and has no upstream dependencies. For transforms and sinks, inputs is mandatory.

A subtle but crucial aspect of remap transforms is how they handle fields. When you assign a new field, like .timestamp = now(), you’re adding it to the event. If you reassign an existing field, like .message = "New message", you’re overwriting it. If you want to conditionally modify a field, you’d use if statements: if .user_id == "admin" { .role = "administrator" } else { .role = "user" }. This allows for fine-grained control over data transformation.

The next step after mastering basic configuration is understanding how to manage multiple inputs and outputs, and how Vector handles backpressure when downstream components can’t keep up.

Want structured learning?

Take the full Vector course →