Splunk’s ability to parse proprietary log formats with custom sourcetypes is surprisingly flexible, but the real magic happens when you realize you’re not just telling Splunk how to read a line, but what that line means in the context of your entire data ingestion pipeline.
Let’s see this in action. Imagine you have a custom application spitting out logs like this:
[2023-10-27 10:30:05] INFO: User 'alice' logged in from 192.168.1.100. Transaction ID: tx_12345
[2023-10-27 10:30:15] WARN: Failed login attempt for user 'bob' from 10.0.0.5. Reason: Invalid password.
[2023-10-27 10:30:20] INFO: Processing request for transaction tx_12345. Status: SUCCESS.
You want Splunk to understand these as distinct events, extract fields like user, ip_address, transaction_id, and log_level, and make them searchable.
First, you need to define a new sourcetype. This is done in a Splunk app’s props.conf file. For our example, let’s call it my_proprietary_log.
# In $SPLUNK_HOME/etc/apps/<your_app_name>/local/props.conf
[my_proprietary_log]
SHOULD_LINEMERGE = false
BREAK_ONLY_BEFORE = \[.*\]\s+(INFO|WARN|ERROR|DEBUG)
TIME_PREFIX = \[
MAX_TIMESTAMP_LOOKAHEAD = 20
TRANSFORMS = my_proprietary_log_kv
Let’s break this down:
[my_proprietary_log]: This stanza defines our new sourcetype.SHOULD_LINEMERGE = false: For this log format, each line is a distinct event. We don’t need Splunk to try and combine multiple physical lines into a single logical event.BREAK_ONLY_BEFORE = \[.*\]\s+(INFO|WARN|ERROR|DEBUG): This is crucial for event demarcation. It tells Splunk that a new event starts before a line that matches this regular expression. The regex looks for a timestamp in square brackets, followed by whitespace, and then one of the common log levels. This ensures each log entry is correctly identified as a separate event.TIME_PREFIX = \[: This tells Splunk that the timestamp for the event begins immediately after an opening square bracket.MAX_TIMESTAMP_LOOKAHEAD = 20: This instructs Splunk to look ahead up to 20 characters from theTIME_PREFIXto find the complete timestamp. This helps it correctly parse[2023-10-27 10:30:05].TRANSFORMS = my_proprietary_log_kv: This is where the field extraction happens. It points to another configuration stanza that defines how to extract key-value pairs.
Now, let’s define the TRANSFORMS stanza in transforms.conf:
# In $SPLUNK_HOME/etc/apps/<your_app_name>/local/transforms.conf
[my_proprietary_log_kv]
DELIMS = ": ", "="
REGEX = \[.*?\]\s+(?P<log_level>\w+):\s+(?P<message>.*?)(?:User\s+'(?P<user>\w+)')?(?:from\s+(?P<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?(?:Transaction\s+ID:\s+(?P<transaction_id>\w+))?
This is the heart of the parsing:
[my_proprietary_log_kv]: This stanza is referenced byTRANSFORMSinprops.conf.DELIMS = ": ", "=": This defines characters that separate key-value pairs. Splunk will look for a colon followed by a space, or an equals sign.REGEX = ...: This is a powerful regular expression that captures specific fields.\[.*?\]\s+: Matches and discards the timestamp part (we already handled time extraction inprops.conf).(?P<log_level>\w+):\s+: Captures the log level (INFO, WARN, etc.) into a field namedlog_level.(?P<message>.*?)\s+: Captures the main log message into a field namedmessage. This is a greedy match up to the next specific pattern.(?:User\s+'(?P<user>\w+)')?: This is an optional non-capturing group ((?:...)) that looks for "User '…'". If found, it captures the username into theuserfield. The?makes the entire group optional, so lines without a user are still parsed.(?:from\s+(?P<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))?: Similarly, this optionally captures an IP address into theip_addressfield.(?:Transaction\s+ID:\s+(?P<transaction_id>\w+))?: This optionally captures a transaction ID into thetransaction_idfield.
The power here is in how Splunk’s parsing pipeline works. First, the props.conf settings (like TIME_PREFIX and BREAK_ONLY_BEFORE) are applied to identify and segment events and extract the timestamp. Then, the TRANSFORMS stanza in transforms.conf is applied to each event to extract fields using the REGEX. This separation of concerns makes it manageable.
After these configurations are in place and your Splunk forwarders are configured to send these logs with sourcetype=my_proprietary_log, you can search:
index=<your_index> sourcetype=my_proprietary_log user="alice"
And Splunk will return the relevant events with user, log_level, ip_address, and transaction_id as searchable fields.
The one thing most people don’t immediately grasp is how BREAK_ONLY_BEFORE interacts with SHOULD_LINEMERGE. If SHOULD_LINEMERGE is true (the default), BREAK_ONLY_BEFORE simply dictates where a new event can start, and Splunk will still try to append subsequent lines if they don’t match BREAK_ONLY_BEFORE. Setting SHOULD_LINEMERGE = false makes BREAK_ONLY_BEFORE absolute – if a line doesn’t match the break pattern, it’s effectively ignored as a new event start, and if it does match, it definitely starts a new event, regardless of what follows.
The next hurdle you’ll likely face is handling more complex, nested, or multi-line proprietary formats where event boundaries aren’t so clean, or when you need to normalize data across multiple, slightly different proprietary log sources.