Parsing arbitrary text logs into structured data is a fundamental problem in observability, and vector’s transform system offers a flexible way to do it.

Let’s see vector parse transforms in action with a common scenario: ingesting unstructured Syslog messages and extracting key fields.

Imagine you have a stream of Syslog messages like this:

Oct 26 10:00:00 myhost sudo[12345]: user alice accepted keyboard-interactive/pam for alice from 192.168.1.100 port 54321 ssh

This is hard to query directly. We want to break it down into fields like hostname, program, pid, user, and source_ip.

Here’s how you’d configure vector to do that. First, we’ll use the syslog transform to parse the standard Syslog header, and then grok to extract the rest.

[transforms.syslog_parser]
type = "syslog"
inputs = ["host_logs"] # Assuming 'host_logs' is your source

[transforms.grok_parser]
type = "grok"
inputs = ["syslog_parser"]
match = [
  "%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:hostname} %{PROG:program}(?:\[%{NUMBER:pid}\])?: %{GREEDYDATA:message}",
]

The syslog_parser transform is built for the syslog protocol. It understands the common structure Oct 26 10:00:00 myhost and automatically populates fields like timestamp and hostname.

The grok_parser then takes the output of syslog_parser. The match field is where the magic happens. Grok uses a set of predefined patterns (like SYSLOGTIMESTAMP, SYSLOGHOST, PROG, NUMBER) and allows you to combine them to define your own.

  • % {SYSLOGTIMESTAMP:timestamp}: Matches the timestamp format and names the extracted field timestamp.
  • % {SYSLOGHOST:hostname}: Matches the hostname and names it hostname.
  • % {PROG:program}: Matches the program name (e.g., sudo) and names it program.
  • (?:\[%{NUMBER:pid}\])?: This is a non-capturing group (?:...) that optionally matches [...] containing a NUMBER (the PID), naming it pid. The ? makes the whole PID part optional.
  • %{GREEDYDATA:message}: Matches everything else until the end of the line and names it message.

After these transforms, your event might look something like this internally:

{
  "timestamp": "Oct 26 10:00:00",
  "hostname": "myhost",
  "program": "sudo",
  "pid": "12345",
  "message": "user alice accepted keyboard-interactive/pam for alice from 192.168.1.100 port 54321 ssh"
}

This is much more usable! You can now filter or aggregate based on program, hostname, or even the message content if you add more grok patterns.

For JSON parsing, it’s even simpler. If your source is already producing JSON, you might just need to ensure it’s correctly deserialized. If you have a string field that contains JSON, you use the json_parser transform.

[transforms.json_unpacker]
type = "json_parser"
inputs = ["raw_json_field_source"]
field = "raw_log_data" # The field containing the JSON string

This transform takes the string in the raw_log_data field and parses it into a nested object structure.

The most surprising aspect of grok is its extensibility. While it comes with many built-in patterns, you can define your own custom patterns within the grok transform itself, allowing you to parse virtually any log format.

[transforms.custom_grok]
type = "grok"
inputs = ["some_source"]
patterns = [
  "MY_CUSTOM_PATTERN %{WORD:word} %{NUMBER:num}",
]
match = ["MY_CUSTOM_PATTERN %{GREEDYDATA:rest_of_line}"]

This allows you to build complex parsing logic piece by piece, making vector incredibly adaptable to diverse log sources without needing external tools for simple parsing tasks.

The next step after parsing is often enriching your data with contextual information, perhaps by looking up IP addresses in a GeoIP database.

Want structured learning?

Take the full Vector course →