The primary cause of errant GTIDs in Vitess is a divergence in the replication state between MySQL primary and secondary instances, specifically when a transaction commit on the primary is not correctly acknowledged or replicated to the secondary, leading to a mismatch in the Global Transaction Identifiers (GTIDs).
Common Causes and Fixes for Errant GTIDs
-
Network Interruption or Slow Replication:
- Diagnosis: Check
SHOW REPLICA STATUS(orSHOW SLAVE STATUSon older MySQL versions) on the secondary. Look forSeconds_Behind_MasterandLast_IO_Errno/Last_SQL_Errnoand their corresponding error messages. A consistently highSeconds_Behind_Masteror specific error codes (e.g., 1062 for duplicate key, 1213 for deadlock) can indicate replication lag or issues. - Fix: If the issue is temporary network congestion or a transient MySQL error, simply restarting the replica’s IO and SQL threads might resolve it.
If the lag persists and is due to resource constraints on the replica, scale up its resources (CPU, RAM, I/O).STOP REPLICA; START REPLICA; - Why it works: This restarts the replication threads, allowing them to catch up from the last known consistent position or re-establish the connection and resume replication. Scaling resources directly addresses performance bottlenecks.
- Diagnosis: Check
-
Primary MySQL Instance Restart/Crash Without Proper Sync:
- Diagnosis: Examine the primary MySQL error logs for crash events or unexpected shutdowns. Compare the
Executed_Gtid_Seton the primary with theRetrieved_Gtid_Seton the secondary. An errant GTID typically arises when the primary commits a transaction, records its GTID, but then crashes or restarts before the secondary has fully processed it and acknowledged it back to the primary’s replication manager (if applicable in the topology). - Fix: This is one of the trickiest. If the primary has restarted and its
Executed_Gtid_Setis now ahead of what the secondary has applied (indicated byExecuted_Gtid_Seton primary >Executed_Gtid_Seton secondary), you often need to manually reset the replica. This involves:- Identifying the GTID set that the replica should have applied up to. This might require manual inspection of the primary’s binary logs or using
SHOW BINARY LOG EVENTSto find the last safely committed transaction before the primary’s issue. - On the replica, stop replication, reset its replication state, and reconfigure it to replicate from the primary’s correct position, skipping the problematic GTID.
STOP REPLICA; -- Identify the GTID set *before* the errant one -- Example: SET GLOBAL SQL_SLAVE_STOP_AFTER_GTIDS = 'your_gtid_set_here'; -- This command is complex and often requires careful manual analysis or specialized tools. -- Once the correct starting point is identified: RESET REPLICA ALL; -- Use with extreme caution! CHANGE REPLICATION SOURCE TO SOURCE_HOST='<primary_host>', SOURCE_PORT=<primary_port>, SOURCE_USER='<replication_user>', SOURCE_PASSWORD='<replication_password>', SOURCE_GTID_EXECUTED='<correct_gtid_set_before_errant>'; START REPLICA;- Important Note:
RESET REPLICA ALLshould be used with extreme caution as it wipes out all replication metadata. TheSOURCE_GTID_EXECUTEDparameter inCHANGE REPLICATION SOURCE TOis the key to manually setting the starting GTID.
- Identifying the GTID set that the replica should have applied up to. This might require manual inspection of the primary’s binary logs or using
- Why it works: This forces the replica to re-initialize its replication state and explicitly tell it which GTID to start from, effectively bypassing the errant GTID that caused the divergence.
- Diagnosis: Examine the primary MySQL error logs for crash events or unexpected shutdowns. Compare the
-
Manual Intervention or Misconfiguration:
- Diagnosis: Review audit logs or operational runbooks. Was any manual
SET GLOBAL GTID_PURGEDorSET GLOBAL GTID_EXECUTEDperformed on either the primary or secondary? CheckSHOW GLOBAL VARIABLES LIKE 'gtid_executed';andSHOW GLOBAL VARIABLES LIKE 'gtid_purged';on both instances. A mismatch here, especially if one has been manually altered, is a strong indicator. - Fix: If manual changes have corrupted the GTID set, the fix is similar to the primary crash scenario: identify the correct GTID set and reset the replica using
RESET REPLICA ALLandCHANGE REPLICATION SOURCE TO SOURCE_GTID_EXECUTED='<correct_gtid_set>'. - Why it works: Correcting the
SOURCE_GTID_EXECUTEDvalue on the replica ensures it resumes replication from a known good state, aligning with the primary’s actual committed transaction history.
- Diagnosis: Review audit logs or operational runbooks. Was any manual
-
Replication Filter Issues:
- Diagnosis: If
replicate_ignore_db,replicate_wild_ignore_table, or other replication filters are in place, a transaction that should have been replicated might be silently ignored by the replica due to a filter rule, leading to a GTID gap. CheckSHOW REPLICA STATUSforFilteredevents or look for specific tables/databases that are not being updated on the replica. - Fix: Review and adjust replication filters to include all necessary transactions. If a filter was incorrectly applied, remove or modify it. Then, restart replication.
-- Example: If 'db_to_ignore' was accidentally filtered STOP REPLICA; -- Remove the filter from my.cnf or my.ini and restart MySQL, or dynamically if supported. -- If dynamic: SET GLOBAL replicate_ignore_db = ''; -- Or adjust the list START REPLICA; - Why it works: By ensuring filters are correctly configured, all intended transactions are now processed by the replica, preventing GTID discrepancies caused by ignored writes.
- Diagnosis: If
-
gtid_strict_modeandlog_binInteraction:- Diagnosis: Ensure
gtid_strict_mode=ONandlog_bin=ON(orlog_bin=1) are set on both the primary and replicas. Iflog_binis off on the primary, GTIDs cannot be generated or tracked. Ifgtid_strict_modeis off, MySQL might behave in ways that can lead to GTID inconsistencies under certain failure scenarios. - Fix: Enable
log_binandgtid_strict_modein themy.cnformy.iniconfiguration file for both primary and replica MySQL instances and restart them.[mysqld] log_bin = ON gtid_strict_mode = ON - Why it works: These settings enforce the use of GTIDs for all binary log events, ensuring robust transaction tracking and preventing replication issues that arise from their absence or inconsistent application.
- Diagnosis: Ensure
-
Vitess
ApplyReplicationLagorVReplicationIssues:- Diagnosis: Examine Vitess tablet logs (
vtctllogs,vtworkerlogs, andvtgatelogs) for errors related to replication lag, GTID mismatches, orApplyReplicationLagoperations. Check thevt_replication_statustable in thevitess_reshardingkeyspace (or equivalent) if you’re in a resharding operation. - Fix: If
ApplyReplicationLagwas used to manually advance a tablet’s replication position, ensure it was done correctly and that the target GTID was accurate. IfVReplicationjobs are failing, investigate the specific errors in thevttabletlogs and correct theVReplicationworkflow. Often, this involves restarting theVReplicationworkflow or correcting the source/destination table mapping.
For# Example: Restarting a VReplication workflow vtctlclient ApplyVSchema --keyspace <keyspace_name> --vschema_file <new_vschema.json> vtctlclient Reshard --keyspace_name <keyspace_name> --source_shards <shard_list> --target_shards <shard_list>ApplyReplicationLag, it’s a more manual process of identifying the correct GTID and then potentially usingRESET REPLICAandCHANGE REPLICATION SOURCE TOon the underlying MySQL replica, guided by Vitess’s state. - Why it works: Vitess relies on accurate GTID tracking. Correcting its internal replication state or workflow ensures that Vitess’s view of replication aligns with the underlying MySQL instances, resolving discrepancies.
- Diagnosis: Examine Vitess tablet logs (
After fixing errant GTIDs, the next error you’re likely to encounter is related to transaction conflicts if the errant GTID caused data to be written or modified differently on the replica compared to the primary, and the replication process attempts to re-apply a similar but conflicting transaction.