mvexpand and mkstring are Splunk’s primary tools for wrangling multivalue fields, but they operate on fundamentally different principles that often trip people up.
Let’s see mvexpand in action. Imagine you have logs where a single event can list multiple IP addresses, like this:
2023-10-27 10:00:01,123 INFO request_id=abc123 user=alice ip_addresses="192.168.1.10,192.168.1.11,192.168.1.12"
By default, Splunk might ingest ip_addresses as a single string. To treat each IP as a distinct entity, you use mvexpand:
... | mvexpand ip_addresses
This transforms the single event into three separate events, each with one IP address:
2023-10-27 10:00:01,123 INFO request_id=abc123 user=alice ip_addresses="192.168.1.10"
2023-10-27 10:00:01,123 INFO request_id=abc123 user=alice ip_addresses="192.168.1.11"
2023-10-27 10:00:01,123 INFO request_id=abc123 user=alice ip_addresses="192.168.1.12"
Notice how mvexpand duplicates the original event, splitting the multivalue field into individual events. This is crucial for counting unique IPs, filtering by a specific IP, or joining with other data based on a single IP.
Now, consider the opposite problem: you have events where a single field might contain multiple values, but you want to aggregate them into a single string for, say, display or a specific type of search. This is where mkstring comes in.
Suppose you have search results where each row represents a different error code, but you want to see all error codes associated with a specific request on a single line:
request_id=xyz789 error_code="ERR_AUTH_FAIL"
request_id=xyz789 error_code="ERR_DB_CONN"
request_id=xyz789 error_code="ERR_TIMEOUT"
If you ran stats values(error_code) by request_id, you’d get:
request_id=xyz789 error_code="ERR_AUTH_FAIL,ERR_DB_CONN,ERR_TIMEOUT"
This is already a multivalue field in Splunk’s internal representation. To make it a single string, you use mkstring:
... | stats values(error_code) as all_errors by request_id | mkstring all_errors
This would produce a single event for request_id=xyz789 where all_errors is a single string:
request_id=xyz789 all_errors="ERR_AUTH_FAIL,ERR_DB_CONN,ERR_TIMEOUT"
The default delimiter for mkstring is a comma. You can specify a different delimiter:
... | stats values(error_code) as all_errors by request_id | mkstring all_errors delim=" | "
This would result in:
request_id=xyz789 all_errors="ERR_AUTH_FAIL | ERR_DB_CONN | ERR_TIMEOUT"
The fundamental difference: mvexpand takes a single event and multiplies it based on the values in a multivalue field. mkstring takes multiple events (or a single event with a multivalue field) and collapses them into a single string value within a single event.
When you see a field that Splunk knows is multivalue (e.g., from an explicit kv extraction that creates a list, or from mvexpand itself), it’s often represented internally with a special syntax, like field{1}=value1, field{2}=value2. This is Splunk’s way of managing these lists before you explicitly expand or stringify them.
The surprise is that mkstring doesn’t just join the values; it fundamentally changes the field’s type from a multivalue representation to a single string. This means after mkstring, you can no longer use mvexpand on that same field directly because it’s no longer recognized as a list of values.
If you’re struggling with performance when using mvexpand, especially on large datasets, consider if you can filter your data before expanding. Expanding early and then filtering can create a massive number of events that Splunk has to process unnecessarily.
Ultimately, understanding that mvexpand is about event multiplication and mkstring is about string aggregation is the key to mastering them.
The next concept you’ll likely grapple with is how these interact with statistical commands like stats and dedup, and the performance implications of each.