How to troubleshoot MySQL replication issues?

In MySQL a big portion of the problems you’re facing is different replication delays so there’s no surprise this is one of the most common interview questions on DBA interviews and I still found people having problems with explaining what they would do in a certain situation. This is why I decided to write a bit about the subject.

1. IO

99% of times the reason is IO. The slave cannot keep up with the amount of writes it gets from the binary logs while in parallel it has to return results of queries as well. In spite of the common belief in MySQL it’s much easier to saturate the disk subsystem (even a raid 10 with 8 SAS disk and cache) than the network.

In this situation all you can do is try to remove some pressure from the slaves.

One way to do this is setting innodb_flush_log_at_trx_commit  to 2 if it used to be 1. Usually this is enough.

2. Bad query on the slave(s)

Bad query running on the slave locking up resources, causing the box to do heavy IO, creating temporary tables and maybe even forcing the  server into swapping.

Detection is very easy: show processlist;  in mysql console.

The easiest fix is to kill the query. Work in most cases perfectly fine.

3. Transactions on the master

Second most common problems which is often overlooked by even the experienced DBAs transactions on the master which prevent the slave to go on with the binlog. How can this happen? On the master transactions can run next to each other if there are no overlap in the locked rows (if there’s any). On the slave however replication is single threaded. Transactions have to happen in order. So a long running transaction on the master can cause the slave to just be idle and wait for the transaction to be finished.

Detection of this is pretty easy. Just do show slave status a couple of times and if you don’t see the Exec_Master_Log_Pos changing you know you have this problem.

Fixing it is a bit harder though. In case it’s still running on the master you can abort it but be aware the rollback may take more time than it would have taken to finish. If it’s already on the binlog there’s nothing really can be done without further patching. If in the last resort the query was killed on the slave and got skipped you have to make sure your data isn’t corrupted. There are several tools for that for example pt-table-sync.

4. Network

In case of lot of slaves or very heavily written master transmitting the binlogs can indeed saturate the network. (Note: I have to have a remark here. If you manage to saturate the network with less than ~-10-20 slaves probably replication isn’t the best solution for your environment.) 

Detecting this is very easy if you have a monitoring/trending system in place (munin, cacti, graphite, zabbix etc.).

You can easily fix the problem with promoting one slave as an intermediate master and move some slaves underneath it. This can be automated (and it is highly recommended) so the whole procedure can be done in seconds in case of emergency.


You might like these too

Difference between DISTINCT and GROUP BY Today we had an interesting situation where the same query was executed significantly slower when it was written with GROUP BY instead of DISTINCT and...
How to decrease IOPS when running MySQL on ZFS Recordsize on data volume This is the first thing you read when it comes to running MySQL on ZFS. InnoDB is using 16k pages so in theory this makes a...
Corrupted InnoDB dictionary Last week one of my collagues came to me with the following problem. He was not able to create an innodb table. MySQL returned ERROR 1005: Can't creat...