The Vitess primary election failed because the existing primary MySQL instance became unresponsive, and the orchestrator couldn’t find a suitable replica to promote.

Common Causes and Fixes

  1. Network Partition/Firewall Issues:

    • Diagnosis: Check connectivity from the orchestrator host to the MySQL primary and all replicas. Use nc -zv <hostname> <port> or telnet <hostname> <port>.
    • Fix: Ensure that the orchestrator’s IP address is allowed through any firewalls or network security groups to reach the MySQL ports (typically 3306) on all relevant instances. For example, if using iptables on the orchestrator host: sudo iptables -I OUTPUT -d <mysql_primary_ip> -p tcp --dport 3306 -j ACCEPT. This command explicitly allows outgoing TCP connections to the MySQL primary’s IP on port 3306.
    • Why it works: The orchestrator needs to communicate with MySQL instances to determine their health and status. Network restrictions prevent this communication, making the orchestrator believe instances are down when they might be healthy.
  2. Orchestrator Service Unresponsive/Crashed:

    • Diagnosis: Check the orchestrator process status: ps aux | grep orchestrator. Look for errors in orchestrator logs, typically found in /var/log/orchestrator/orchestrator.log or similar.
    • Fix: Restart the orchestrator service. On systemd systems: sudo systemctl restart orchestrator. This restarts the orchestrator daemon, allowing it to re-establish its connection to the MySQL topology and attempt elections.
    • Why it works: A crashed or hung orchestrator process cannot perform its duties of monitoring and managing the MySQL topology, including initiating primary elections.
  3. MySQL Primary Instance Actual Crash/Unavailability:

    • Diagnosis: Attempt to connect to the MySQL primary directly from the orchestrator host: mysql -h <mysql_primary_ip> -u <user> -p. Check MySQL error logs on the primary instance itself (/var/log/mysql/error.log or similar).
    • Fix: If the primary is truly down, you’ll need to address the underlying MySQL issue (e.g., disk full, corruption, hardware failure). If it’s a transient issue and the instance is recoverable, try restarting MySQL: sudo systemctl restart mysql. If recovery is not immediate, proceed with a forced election.
    • Why it works: The orchestrator detects the primary’s unresponsiveness and initiates a failover. If the primary is truly dead, it can’t be revived by the orchestrator, necessitating the election of a new primary.
  4. Replica Lag Too High for Promotion:

    • Diagnosis: In the orchestrator UI or via its API, check the SecondsBehindMaster value for all candidate replicas. If all are significantly high (e.g., > 60 seconds), the orchestrator might refuse to promote.
    • Fix: Manually intervene on the replica you want to promote. Stop replication (STOP REPLICA;), reset slave status (RESET REPLICA ALL;), reconfigure replication pointing to another healthy replica if possible, or directly promote it if it’s the last resort: orchestrator-client --config=/etc/orchestrator/orchestrator.conf.json --exec-promotion --mysql-host=<replica_ip> --mysql-port=3306. This command forces orchestrator to execute a primary promotion for the specified replica, overriding lag concerns.
    • Why it works: Vitess relies on replicas being reasonably up-to-date to minimize data loss during failover. If lag is too high, the orchestrator’s default safety mechanisms prevent promotion. Forcing it bypasses this check, assuming you accept the potential data loss.
  5. Orchestrator Configuration Issues:

    • Diagnosis: Review the orchestrator configuration file (e.g., /etc/orchestrator/orchestrator.conf.json). Look for incorrect MySQLTopologyUser, MySQLTopologyPassword, or incorrect cluster/datacenter definitions.
    • Fix: Correct the configuration parameters in /etc/orchestrator/orchestrator.conf.json. For example, ensure MySQLTopologyUser has sufficient privileges (REPLICATION CLIENT, REPLICATION SLAVE, SUPER, RELOAD, PROCESS) and that the password is correct. After editing, restart orchestrator: sudo systemctl restart orchestrator.
    • Why it works: Incorrect credentials or topology definitions prevent orchestrator from properly discovering and communicating with the MySQL instances, leading to failed elections.
  6. Underlying Storage Issues on Replicas:

    • Diagnosis: Check disk space (df -h) and I/O performance (iostat -xz 1) on potential replica candidates. Look for MySQL errors related to writing to data files.
    • Fix: Resolve storage issues (e.g., free up disk space, address I/O bottlenecks). If a replica is promoted but its storage is faulty, it will become the new problematic primary. For example, to free up space: sudo rm -rf /path/to/old/mysql/logs/*.
    • Why it works: A replica cannot be promoted if it cannot write its binary logs or data files, which is essential for it to function as a primary.
  7. Replication Filters/GTID Issues:

    • Diagnosis: Examine SHOW REPLICA STATUS on candidate replicas. Look for Replicate_Ignore_Server_Ids or Replicate_Do_DB settings that might prevent it from replicating necessary events or serving writes. Check GTID consistency across instances.
    • Fix: Temporarily remove restrictive replication filters or reconfigure GTID settings if they are causing a divergence. This often requires stopping replication, clearing relevant CHANGE REPLICATION SOURCE TO parameters, and restarting. For example, to clear filters: CHANGE REPLICATION SOURCE TO IGNORE_SERVER_IDS = ();.
    • Why it works: Incorrect replication filtering can prevent a replica from receiving or processing all necessary transactions, making it an unsuitable candidate for promotion to primary. GTID inconsistencies can also break replication chains.

The next error you’ll likely hit after a successful emergency reparent is related to the vtgate or vtctld services not recognizing the new primary, possibly showing errors like "no primary found for keyspace X shard Y" or "tabletserver not available."

Want structured learning?

Take the full Vitess course →