Splunk indexers don’t actually store data until they’ve received an acknowledgment from the forwarder, which is a surprisingly fragile mechanism that can lead to silent data loss.

Let’s watch an indexer receive data and then simulate a failure.

# On the indexer, tailing the raw data stream from a forwarder
tail -f /opt/splunk/var/log/splunk/metrics.log | grep "received"

# On a forwarder, sending a small test event
/opt/splunk/bin/splunk add oneshot /tmp/test.log --index=main --source=test_ack --sourcetype=test_ack
# (assuming /tmp/test.log contains "This is a test event for ack.")

You’ll see lines like Received batch of 100 events (100 new, 0 repeated) on tcp:9997. This is the indexer confirming it has the data. Now, let’s break it.

# On the forwarder, stopping the Splunk process *before* it sends the ack
sudo systemctl stop splunk

If you check your search in Splunk, you won’t find that "test event." The indexer received it, but the forwarder never got the acknowledgment back that it successfully persisted it. Without that ack, the forwarder, assuming the data wasn’t saved, will try to resend it later from its queue.

The core problem is that cooked-mode (the default) relies on a TCP acknowledgment from the indexer to the forwarder. If that ack doesn’t make it back (network blip, forwarder crash, indexer crash after receiving but before acking), the forwarder thinks the data is lost. The indexer, however, did have the data and likely wrote it to its temporary files before the ack could be sent. This is a race condition.

Here are the common failure points and how to fix them:

  1. Network Interruption: The most common culprit. A momentary network blip between the forwarder and indexer can prevent the acknowledgment from returning.

    • Diagnosis: Monitor network connectivity between forwarders and indexers using ping and traceroute. Check Splunk’s internal logs (splunkd.log on both forwarder and indexer) for connection errors.
    • Fix: Ensure robust network infrastructure. For transient issues, Splunk’s internal retry mechanisms in outputs.conf can help, but persistent loss requires network stability.
    • Why it works: A stable network ensures the TCP ACK packet makes it back to the forwarder, allowing it to mark the data as successfully sent and clear its queue.
  2. Forwarder Crash or Restart Before Ack: As demonstrated above, if the forwarder dies or restarts before receiving the ack, it will try to re-send data it thinks is lost.

    • Diagnosis: Check the forwarder’s splunkd.log for crash messages or restarts. Look for entries indicating it’s trying to re-send data from its hot/warm queues.
    • Fix: Ensure forwarders are stable. If restarts are unavoidable, configure outputs.conf to use a resilient queueing mechanism.
    • Why it works: By ensuring the forwarder process stays alive long enough to get the ack, it knows the data is safe on the indexer.
  3. Indexer Crash or Restart After Receiving but Before Ack: This is the trickiest. The indexer gets the data, writes it to disk (but perhaps not durably yet), and then crashes before it can send the ACK.

    • Diagnosis: Examine the indexer’s splunkd.log for crashes or unexpected shutdowns. Check the OS’s system logs for disk I/O errors or OOM killer activity.
    • Fix: Configure outputs.conf on the forwarder with heartbeat_interval = 5 (or a lower value). This makes the forwarder send "heartbeat" packets more frequently, which can prompt the indexer to flush its internal buffers and send ACKs more promptly. Also, ensure indexer disks are healthy and not saturated.
    • Why it works: The heartbeat forces the indexer to acknowledge receipt of data more often, reducing the window where it could crash before sending an ACK for a batch.
  4. Throttling on the Indexer: If the indexer is overloaded (high CPU, disk I/O, or network saturation), it might be slow to process incoming data and send ACKs.

    • Diagnosis: Monitor indexer performance metrics (splunkd.log for throttling messages, OS tools like top, iostat, netstat). Look for throttled=true in metrics.log.
    • Fix: Scale up indexer resources (CPU, RAM, disk I/O). Optimize indexing performance by tuning indexes.conf (e.g., maxHotBuckets, maxConcurrentOptimizes) or distribute the load across more indexers.
    • Why it works: A healthy, responsive indexer can process incoming data batches quickly and send ACKs back to forwarders without delay, preventing queue buildup and potential data loss.
  5. tcpout Configuration Issues: Incorrect settings in outputs.conf on the forwarder can lead to problems.

    • Diagnosis: Review outputs.conf on the forwarder. Common issues include incorrect server entries, missing index specifications, or misconfigured sendCookedData (though true is default and usually desired for acking).
    • Fix: Ensure outputs.conf has a valid [tcpout] stanza pointing to your indexers, and that sendCookedData = true is set (or omitted, as it’s the default).
    • Why it works: Correct outputs.conf settings ensure the forwarder is configured to use the acknowledgment mechanism properly and is sending data to the correct destination.
  6. auto_liveness_check Disabled: While not directly about ACKs, this impacts forwarder-indexer communication health.

    • Diagnosis: Check outputs.conf on the forwarder for auto_liveness_check = false.
    • Fix: Set auto_liveness_check = true (it’s the default).
    • Why it works: This setting causes the forwarder to periodically check if the indexer is alive and responsive, which helps it detect connectivity issues sooner and potentially avoid sending data to a dead endpoint.

If you’ve fixed all these, the next thing you’ll likely encounter is a too many HUPs error in splunkd.log on the forwarder if it’s repeatedly trying to send data that the indexer is still processing or refusing, indicating a deeper indexing performance bottleneck.

Want structured learning?

Take the full Splunk course →