The Splunk indexer cluster’s core indexer failed to replicate buckets, causing data distribution to become uneven across peers.
Common Causes and Fixes
-
Network Connectivity Issues: The most frequent culprit is intermittent or blocked network traffic between indexers, preventing replication.
- Diagnosis: On the indexer that’s not receiving buckets, run
netstat -anp | grep <peer_ip>:<replication_port>(default 8089) to see if connections are established. On the indexer that should be sending, checksudo tcpdump -i <interface> host <peer_ip> and port <replication_port>. - Fix: Ensure firewalls (e.g.,
sudo ufw status,sudo iptables -L) on both the sending and receiving indexers allow traffic on the replication port (default 8089) between cluster peers. If using cloud security groups, verify inbound and outbound rules. - Why it works: Replication relies on direct TCP connections between indexers. If the network path is blocked, buckets cannot be transferred.
- Diagnosis: On the indexer that’s not receiving buckets, run
-
Insufficient Disk Space on Receiving Indexer: If the indexer that’s supposed to receive buckets is out of disk space, replication will fail.
- Diagnosis: Run
df -hon the receiving indexer. Look for partitions mounted for$SPLUNK_HOME/var/lib/splunkthat are at or near 100% usage. - Fix: Free up space by archiving old buckets or adding more storage. To manually move buckets if space is the issue, you can stop Splunk on the affected peer, move data to a different partition with more space, and then update the
$SPLUNK_HOME/etc/slave-apps/search/local/indexes.conf(or relevantindexes.conffile) to point to the new location. Restart Splunk. - Why it works: Splunk needs free disk space to write new bucket data during replication.
- Diagnosis: Run
-
replication_portMismatch inserver.conf: A simple typo or incorrect configuration inserver.confcan prevent indexers from finding each other for replication.- Diagnosis: On each indexer, check
$SPLUNK_HOME/etc/system/local/server.conf(or$SPLUNK_HOME/etc/apps/<app>/local/server.conf) for thereplication_portsetting. Ensure it’s identical on all peers and matches the port Splunk is listening on. You can verify the listening port withsudo ss -tulnp | grep splunkd. - Fix: Edit
server.confon all indexers to use the same, correctreplication_port. For example:
Restart Splunk on all affected peers.[general] replication_port = 8089 - Why it works: This is the port indexers use to communicate for replication and other cluster operations. A mismatch means they can’t establish a connection.
- Diagnosis: On each indexer, check
-
remote_peerConfiguration Errors inserver.conf: Incorrectly specifiedremote_peerentries inserver.confcan lead to indexers trying to connect to non-existent or incorrect peers.- Diagnosis: Review
$SPLUNK_HOME/etc/system/local/server.confon each indexer. Look for the[remote-peer]stanzas. Ensure thehostandportvalues accurately reflect the other indexers in the cluster. - Fix: Correct any
hostorportvalues within the[remote-peer]stanzas to match the actual IP addresses/hostnames and replication ports of the other cluster members. Restart Splunk on affected peers. - Why it works:
remote_peerentries explicitly define the cluster members Splunk should attempt to connect to for cluster operations.
- Diagnosis: Review
-
Indexer Resource Saturation (CPU/Memory): An overloaded indexer may be too slow to respond to replication requests or to process incoming data, causing timeouts and missed buckets.
- Diagnosis: Use
toporhtopto monitor CPU and memory usage on the indexers. Look for Splunk processes consuming consistently high resources (e.g.,splunkdprocesses). Check Splunk’s internal logs (_internalindex) for errors related to replication or high latency. - Fix: Optimize Splunk configurations (e.g., adjust
max_concurrent_indexinginlimits.conf, tune parsing/indexing pipelines) or scale up the hardware (more CPU, RAM) of the affected indexers. - Why it works: Replication is a background process that requires sufficient system resources to complete in a timely manner. When resources are scarce, it can fall behind.
- Diagnosis: Use
-
Corrupted
distsearch.logorreplication.log: These logs track the state of bucket replication. Corruption can lead to Splunk losing track of what needs to be replicated.- Diagnosis: Inspect the contents of
$SPLUNK_HOME/var/log/splunk/distsearch.logand$SPLUNK_HOME/var/log/splunk/replication.logon the indexer experiencing replication issues. Look for repeated errors or unusual patterns. - Fix: Stop Splunk on the affected indexer. Back up and then delete or rename the
distsearch.logandreplication.logfiles. Restart Splunk. The system will rebuild these logs. - Why it works: By clearing these logs, Splunk is forced to re-evaluate the state of its buckets and initiate necessary replication actions from a clean slate.
- Diagnosis: Inspect the contents of
-
Master Node Issues: While less common for peer rebalancing, if the master node is unhealthy, it can indirectly affect replication by failing to properly coordinate cluster state or assign bucket copies.
- Diagnosis: Check the status of the master node using
splunk show cluster-bundle-statuson any indexer or by logging into the master node and checkingsplunkdstatus and its own internal logs. - Fix: Address any issues reported by the master node. This might involve restarting the master, ensuring its network connectivity, or checking its configuration.
- Why it works: The master node is the orchestrator. If it’s not functioning correctly, cluster-wide operations, including replication, can falter.
- Diagnosis: Check the status of the master node using
Once all buckets are properly replicated, you might encounter errors related to search head cluster member synchronization if that was also impacted.