Fix Vitess Errant GTID: Replication Consistency Recovery (2026)

The primary cause of errant GTIDs in Vitess is a divergence in the replication state between MySQL primary and secondary instances, specifically when a transaction commit on the primary is not correctly acknowledged or replicated to the secondary, leading to a mismatch in the Global Transaction Identifiers (GTIDs).

Common Causes and Fixes for Errant GTIDs

Network Interruption or Slow Replication:
- Diagnosis: Check SHOW REPLICA STATUS (or SHOW SLAVE STATUS on older MySQL versions) on the secondary. Look for Seconds_Behind_Master and Last_IO_Errno/Last_SQL_Errno and their corresponding error messages. A consistently high Seconds_Behind_Master or specific error codes (e.g., 1062 for duplicate key, 1213 for deadlock) can indicate replication lag or issues.
- Fix: If the issue is temporary network congestion or a transient MySQL error, simply restarting the replica’s IO and SQL threads might resolve it.
```
STOP REPLICA;
START REPLICA;
```
  If the lag persists and is due to resource constraints on the replica, scale up its resources (CPU, RAM, I/O).
- Why it works: This restarts the replication threads, allowing them to catch up from the last known consistent position or re-establish the connection and resume replication. Scaling resources directly addresses performance bottlenecks.
Primary MySQL Instance Restart/Crash Without Proper Sync:
- Diagnosis: Examine the primary MySQL error logs for crash events or unexpected shutdowns. Compare the Executed_Gtid_Set on the primary with the Retrieved_Gtid_Set on the secondary. An errant GTID typically arises when the primary commits a transaction, records its GTID, but then crashes or restarts before the secondary has fully processed it and acknowledged it back to the primary’s replication manager (if applicable in the topology).
- Fix: This is one of the trickiest. If the primary has restarted and its Executed_Gtid_Set is now ahead of what the secondary has applied (indicated by Executed_Gtid_Set on primary > Executed_Gtid_Set on secondary), you often need to manually reset the replica. This involves:
  - Identifying the GTID set that the replica should have applied up to. This might require manual inspection of the primary’s binary logs or using SHOW BINARY LOG EVENTS to find the last safely committed transaction before the primary’s issue.
  - On the replica, stop replication, reset its replication state, and reconfigure it to replicate from the primary’s correct position, skipping the problematic GTID.
```
STOP REPLICA;
-- Identify the GTID set *before* the errant one
-- Example: SET GLOBAL SQL_SLAVE_STOP_AFTER_GTIDS = 'your_gtid_set_here';
-- This command is complex and often requires careful manual analysis or specialized tools.
-- Once the correct starting point is identified:
RESET REPLICA ALL; -- Use with extreme caution!
CHANGE REPLICATION SOURCE TO SOURCE_HOST='<primary_host>', SOURCE_PORT=<primary_port>, SOURCE_USER='<replication_user>', SOURCE_PASSWORD='<replication_password>', SOURCE_GTID_EXECUTED='<correct_gtid_set_before_errant>';
START REPLICA;
```
  - Important Note: RESET REPLICA ALL should be used with extreme caution as it wipes out all replication metadata. The SOURCE_GTID_EXECUTED parameter in CHANGE REPLICATION SOURCE TO is the key to manually setting the starting GTID.
- Why it works: This forces the replica to re-initialize its replication state and explicitly tell it which GTID to start from, effectively bypassing the errant GTID that caused the divergence.
Manual Intervention or Misconfiguration:
- Diagnosis: Review audit logs or operational runbooks. Was any manual SET GLOBAL GTID_PURGED or SET GLOBAL GTID_EXECUTED performed on either the primary or secondary? Check SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; and SHOW GLOBAL VARIABLES LIKE 'gtid_purged'; on both instances. A mismatch here, especially if one has been manually altered, is a strong indicator.
- Fix: If manual changes have corrupted the GTID set, the fix is similar to the primary crash scenario: identify the correct GTID set and reset the replica using RESET REPLICA ALL and CHANGE REPLICATION SOURCE TO SOURCE_GTID_EXECUTED='<correct_gtid_set>'.
- Why it works: Correcting the SOURCE_GTID_EXECUTED value on the replica ensures it resumes replication from a known good state, aligning with the primary’s actual committed transaction history.
Replication Filter Issues:
- Diagnosis: If replicate_ignore_db, replicate_wild_ignore_table, or other replication filters are in place, a transaction that should have been replicated might be silently ignored by the replica due to a filter rule, leading to a GTID gap. Check SHOW REPLICA STATUS for Filtered events or look for specific tables/databases that are not being updated on the replica.
- Fix: Review and adjust replication filters to include all necessary transactions. If a filter was incorrectly applied, remove or modify it. Then, restart replication.
```
-- Example: If 'db_to_ignore' was accidentally filtered
STOP REPLICA;
-- Remove the filter from my.cnf or my.ini and restart MySQL, or dynamically if supported.
-- If dynamic:
SET GLOBAL replicate_ignore_db = ''; -- Or adjust the list
START REPLICA;
```
- Why it works: By ensuring filters are correctly configured, all intended transactions are now processed by the replica, preventing GTID discrepancies caused by ignored writes.
gtid_strict_mode and log_bin Interaction:
- Diagnosis: Ensure gtid_strict_mode=ON and log_bin=ON (or log_bin=1) are set on both the primary and replicas. If log_bin is off on the primary, GTIDs cannot be generated or tracked. If gtid_strict_mode is off, MySQL might behave in ways that can lead to GTID inconsistencies under certain failure scenarios.
- Fix: Enable log_bin and gtid_strict_mode in the my.cnf or my.ini configuration file for both primary and replica MySQL instances and restart them.
```
[mysqld]
log_bin = ON
gtid_strict_mode = ON
```
- Why it works: These settings enforce the use of GTIDs for all binary log events, ensuring robust transaction tracking and preventing replication issues that arise from their absence or inconsistent application.
Vitess ApplyReplicationLag or VReplication Issues:
- Diagnosis: Examine Vitess tablet logs (vtctl logs, vtworker logs, and vtgate logs) for errors related to replication lag, GTID mismatches, or ApplyReplicationLag operations. Check the vt_replication_status table in the vitess_resharding keyspace (or equivalent) if you’re in a resharding operation.
- Fix: If ApplyReplicationLag was used to manually advance a tablet’s replication position, ensure it was done correctly and that the target GTID was accurate. If VReplication jobs are failing, investigate the specific errors in the vttablet logs and correct the VReplication workflow. Often, this involves restarting the VReplication workflow or correcting the source/destination table mapping.
```
# Example: Restarting a VReplication workflow
vtctlclient ApplyVSchema --keyspace <keyspace_name> --vschema_file <new_vschema.json>
vtctlclient Reshard --keyspace_name <keyspace_name> --source_shards <shard_list> --target_shards <shard_list>
```
  For ApplyReplicationLag, it’s a more manual process of identifying the correct GTID and then potentially using RESET REPLICA and CHANGE REPLICATION SOURCE TO on the underlying MySQL replica, guided by Vitess’s state.
- Why it works: Vitess relies on accurate GTID tracking. Correcting its internal replication state or workflow ensures that Vitess’s view of replication aligns with the underlying MySQL instances, resolving discrepancies.

After fixing errant GTIDs, the next error you’re likely to encounter is related to transaction conflicts if the errant GTID caused data to be written or modified differently on the replica compared to the primary, and the replication process attempts to re-apply a similar but conflicting transaction.

Common Causes and Fixes for Errant GTIDs

More Deep Dives in Vitess