Splunk indexers don’t actually store data until they’ve received an acknowledgment from the forwarder, which is a surprisingly fragile mechanism that can lead to silent data loss.
Let’s watch an indexer receive data and then simulate a failure.
# On the indexer, tailing the raw data stream from a forwarder
tail -f /opt/splunk/var/log/splunk/metrics.log | grep "received"
# On a forwarder, sending a small test event
/opt/splunk/bin/splunk add oneshot /tmp/test.log --index=main --source=test_ack --sourcetype=test_ack
# (assuming /tmp/test.log contains "This is a test event for ack.")
You’ll see lines like Received batch of 100 events (100 new, 0 repeated) on tcp:9997. This is the indexer confirming it has the data. Now, let’s break it.
# On the forwarder, stopping the Splunk process *before* it sends the ack
sudo systemctl stop splunk
If you check your search in Splunk, you won’t find that "test event." The indexer received it, but the forwarder never got the acknowledgment back that it successfully persisted it. Without that ack, the forwarder, assuming the data wasn’t saved, will try to resend it later from its queue.
The core problem is that cooked-mode (the default) relies on a TCP acknowledgment from the indexer to the forwarder. If that ack doesn’t make it back (network blip, forwarder crash, indexer crash after receiving but before acking), the forwarder thinks the data is lost. The indexer, however, did have the data and likely wrote it to its temporary files before the ack could be sent. This is a race condition.
Here are the common failure points and how to fix them:
-
Network Interruption: The most common culprit. A momentary network blip between the forwarder and indexer can prevent the acknowledgment from returning.
- Diagnosis: Monitor network connectivity between forwarders and indexers using
pingandtraceroute. Check Splunk’s internal logs (splunkd.logon both forwarder and indexer) for connection errors. - Fix: Ensure robust network infrastructure. For transient issues, Splunk’s internal retry mechanisms in
outputs.confcan help, but persistent loss requires network stability. - Why it works: A stable network ensures the TCP ACK packet makes it back to the forwarder, allowing it to mark the data as successfully sent and clear its queue.
- Diagnosis: Monitor network connectivity between forwarders and indexers using
-
Forwarder Crash or Restart Before Ack: As demonstrated above, if the forwarder dies or restarts before receiving the ack, it will try to re-send data it thinks is lost.
- Diagnosis: Check the forwarder’s
splunkd.logfor crash messages or restarts. Look for entries indicating it’s trying to re-send data from its hot/warm queues. - Fix: Ensure forwarders are stable. If restarts are unavoidable, configure
outputs.confto use a resilient queueing mechanism. - Why it works: By ensuring the forwarder process stays alive long enough to get the ack, it knows the data is safe on the indexer.
- Diagnosis: Check the forwarder’s
-
Indexer Crash or Restart After Receiving but Before Ack: This is the trickiest. The indexer gets the data, writes it to disk (but perhaps not durably yet), and then crashes before it can send the ACK.
- Diagnosis: Examine the indexer’s
splunkd.logfor crashes or unexpected shutdowns. Check the OS’s system logs for disk I/O errors or OOM killer activity. - Fix: Configure
outputs.confon the forwarder withheartbeat_interval = 5(or a lower value). This makes the forwarder send "heartbeat" packets more frequently, which can prompt the indexer to flush its internal buffers and send ACKs more promptly. Also, ensure indexer disks are healthy and not saturated. - Why it works: The heartbeat forces the indexer to acknowledge receipt of data more often, reducing the window where it could crash before sending an ACK for a batch.
- Diagnosis: Examine the indexer’s
-
Throttling on the Indexer: If the indexer is overloaded (high CPU, disk I/O, or network saturation), it might be slow to process incoming data and send ACKs.
- Diagnosis: Monitor indexer performance metrics (
splunkd.logforthrottlingmessages, OS tools liketop,iostat,netstat). Look forthrottled=trueinmetrics.log. - Fix: Scale up indexer resources (CPU, RAM, disk I/O). Optimize indexing performance by tuning
indexes.conf(e.g.,maxHotBuckets,maxConcurrentOptimizes) or distribute the load across more indexers. - Why it works: A healthy, responsive indexer can process incoming data batches quickly and send ACKs back to forwarders without delay, preventing queue buildup and potential data loss.
- Diagnosis: Monitor indexer performance metrics (
-
tcpoutConfiguration Issues: Incorrect settings inoutputs.confon the forwarder can lead to problems.- Diagnosis: Review
outputs.confon the forwarder. Common issues include incorrectserverentries, missingindexspecifications, or misconfiguredsendCookedData(thoughtrueis default and usually desired for acking). - Fix: Ensure
outputs.confhas a valid[tcpout]stanza pointing to your indexers, and thatsendCookedData = trueis set (or omitted, as it’s the default). - Why it works: Correct
outputs.confsettings ensure the forwarder is configured to use the acknowledgment mechanism properly and is sending data to the correct destination.
- Diagnosis: Review
-
auto_liveness_checkDisabled: While not directly about ACKs, this impacts forwarder-indexer communication health.- Diagnosis: Check
outputs.confon the forwarder forauto_liveness_check = false. - Fix: Set
auto_liveness_check = true(it’s the default). - Why it works: This setting causes the forwarder to periodically check if the indexer is alive and responsive, which helps it detect connectivity issues sooner and potentially avoid sending data to a dead endpoint.
- Diagnosis: Check
If you’ve fixed all these, the next thing you’ll likely encounter is a too many HUPs error in splunkd.log on the forwarder if it’s repeatedly trying to send data that the indexer is still processing or refusing, indicating a deeper indexing performance bottleneck.