The Splunk indexer cluster’s core indexer failed to replicate buckets, causing data distribution to become uneven across peers.

Common Causes and Fixes

  1. Network Connectivity Issues: The most frequent culprit is intermittent or blocked network traffic between indexers, preventing replication.

    • Diagnosis: On the indexer that’s not receiving buckets, run netstat -anp | grep <peer_ip>:<replication_port> (default 8089) to see if connections are established. On the indexer that should be sending, check sudo tcpdump -i <interface> host <peer_ip> and port <replication_port>.
    • Fix: Ensure firewalls (e.g., sudo ufw status, sudo iptables -L) on both the sending and receiving indexers allow traffic on the replication port (default 8089) between cluster peers. If using cloud security groups, verify inbound and outbound rules.
    • Why it works: Replication relies on direct TCP connections between indexers. If the network path is blocked, buckets cannot be transferred.
  2. Insufficient Disk Space on Receiving Indexer: If the indexer that’s supposed to receive buckets is out of disk space, replication will fail.

    • Diagnosis: Run df -h on the receiving indexer. Look for partitions mounted for $SPLUNK_HOME/var/lib/splunk that are at or near 100% usage.
    • Fix: Free up space by archiving old buckets or adding more storage. To manually move buckets if space is the issue, you can stop Splunk on the affected peer, move data to a different partition with more space, and then update the $SPLUNK_HOME/etc/slave-apps/search/local/indexes.conf (or relevant indexes.conf file) to point to the new location. Restart Splunk.
    • Why it works: Splunk needs free disk space to write new bucket data during replication.
  3. replication_port Mismatch in server.conf: A simple typo or incorrect configuration in server.conf can prevent indexers from finding each other for replication.

    • Diagnosis: On each indexer, check $SPLUNK_HOME/etc/system/local/server.conf (or $SPLUNK_HOME/etc/apps/<app>/local/server.conf) for the replication_port setting. Ensure it’s identical on all peers and matches the port Splunk is listening on. You can verify the listening port with sudo ss -tulnp | grep splunkd.
    • Fix: Edit server.conf on all indexers to use the same, correct replication_port. For example:
      [general]
      replication_port = 8089
      
      Restart Splunk on all affected peers.
    • Why it works: This is the port indexers use to communicate for replication and other cluster operations. A mismatch means they can’t establish a connection.
  4. remote_peer Configuration Errors in server.conf: Incorrectly specified remote_peer entries in server.conf can lead to indexers trying to connect to non-existent or incorrect peers.

    • Diagnosis: Review $SPLUNK_HOME/etc/system/local/server.conf on each indexer. Look for the [remote-peer] stanzas. Ensure the host and port values accurately reflect the other indexers in the cluster.
    • Fix: Correct any host or port values within the [remote-peer] stanzas to match the actual IP addresses/hostnames and replication ports of the other cluster members. Restart Splunk on affected peers.
    • Why it works: remote_peer entries explicitly define the cluster members Splunk should attempt to connect to for cluster operations.
  5. Indexer Resource Saturation (CPU/Memory): An overloaded indexer may be too slow to respond to replication requests or to process incoming data, causing timeouts and missed buckets.

    • Diagnosis: Use top or htop to monitor CPU and memory usage on the indexers. Look for Splunk processes consuming consistently high resources (e.g., splunkd processes). Check Splunk’s internal logs (_internal index) for errors related to replication or high latency.
    • Fix: Optimize Splunk configurations (e.g., adjust max_concurrent_indexing in limits.conf, tune parsing/indexing pipelines) or scale up the hardware (more CPU, RAM) of the affected indexers.
    • Why it works: Replication is a background process that requires sufficient system resources to complete in a timely manner. When resources are scarce, it can fall behind.
  6. Corrupted distsearch.log or replication.log: These logs track the state of bucket replication. Corruption can lead to Splunk losing track of what needs to be replicated.

    • Diagnosis: Inspect the contents of $SPLUNK_HOME/var/log/splunk/distsearch.log and $SPLUNK_HOME/var/log/splunk/replication.log on the indexer experiencing replication issues. Look for repeated errors or unusual patterns.
    • Fix: Stop Splunk on the affected indexer. Back up and then delete or rename the distsearch.log and replication.log files. Restart Splunk. The system will rebuild these logs.
    • Why it works: By clearing these logs, Splunk is forced to re-evaluate the state of its buckets and initiate necessary replication actions from a clean slate.
  7. Master Node Issues: While less common for peer rebalancing, if the master node is unhealthy, it can indirectly affect replication by failing to properly coordinate cluster state or assign bucket copies.

    • Diagnosis: Check the status of the master node using splunk show cluster-bundle-status on any indexer or by logging into the master node and checking splunkd status and its own internal logs.
    • Fix: Address any issues reported by the master node. This might involve restarting the master, ensuring its network connectivity, or checking its configuration.
    • Why it works: The master node is the orchestrator. If it’s not functioning correctly, cluster-wide operations, including replication, can falter.

Once all buckets are properly replicated, you might encounter errors related to search head cluster member synchronization if that was also impacted.

Want structured learning?

Take the full Splunk course →