In a MySQL replication deployment, the master is a single point of failure. To recover after the failure of this critical component, a common solution is to promote a slave to be the new master. However, when doing so using classic methods, the slaves need to be reconfigured. This is a tedious operation in which many things can go wrong. We found a more simple way to achieve master promotion using Binlog Server. Read on for more details.
When a master fails in a MySQL replication deployment, the classic way to promote a slave to be the new master is the following:
- Find the most up-to-date slave.
- If the most up-to-date slave is not a good candidate master, level a suitable candidate with the most up-to-date slave [1].
- Repoint the remaining slaves to the new master.
The procedure above needs to contact all slaves in step #1, and to reconfigure all slaves in step #3. This becomes increasingly complex in Booking.com environments where we have very wide, and still growing, replication topologies; it is not uncommon to have more than fifty (and sometimes more than a hundred) slaves replicating from the same master. Many things can go wrong when tens of slaves need to be contacted and reconfigured:
- some slaves might be down for maintenance or for taking a backup,
- some slaves could be temporarily unreachable for other reasons,
- and a few slaves could be processing a big backlog of relay logs (including delayed slaves), which will make them hard/unsuitable to reconfigure.
A way to reduce the complexity of master promotion is presented below, but to get there, we must first give some context about Binlog Servers and abstract them into a service.
Reminders about Binlog Servers
In a previous post, I described how to take advantage of
Binlog Server to perform master
promotion without GTIDs and without log-slave-updates
, while still
requiring to reconfigure all slaves. To do this, the slaves must
replicate through a Binlog Server. This gives us the following
deployment with a single Binlog Server:
+---+ | A | +---+ | / \ / X \ ----- | +----------+----------+----------+----------+----------+ | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ | B | | C | | D | | E | | F | | G | +---+ +---+ +---+ +---+ +---+ +---+
or with redundant Binlog Servers:
+---+ | A | +---+ | +--------------------------------+ | | / \ / \ / X \ / Y \ ----- ----- | | +----------+----------+ +----------+----------+ | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ | B | | C | | D | | E | | F | | G | +---+ +---+ +---+ +---+ +---+ +---+
or with more than one site with redundant Binlog Servers.
+---+ | A | +---+ | +-----------+------------------------+ | | | / \ / \ / \ / \ / X \ / Y \ / Z \------>/ W \ ----- ----- ----- ----- | | | | +-+-----------+-+ +-+-----------+-+ | | | | +---+ +---+ +---+ +---+ | S1| ... | Sn| | T1| ... | Tm| +---+ +---+ +---+ +---+
These schemas are becoming increasingly complex - let's simplify them by abstracting the Binlog Servers.
Binlog Server Abstraction
By hiding the Binlog Servers in an abstracted layer, which I call the Distributed Binlog Serving Server (DBSS), a deployment on three sites becomes the following:
+---+ | M | +---+ | +----+----------------------------------------------------------+ | | +----+---------+-----------+---------+-----------+---------+----+ | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ | S1| ... | Sn| | T1| ... | Tm| | U1| ... | Uo| +---+ +---+ +---+ +---+ +---+ +---+
Of course, the DBSS is built with many Binlog Servers. One way to build the layer above minimizing the number of slaves served by the master is described below. Other ways to build this layer can be imagined [2], but let's stick to this one, for now.
+----|----------------------------------------------------------+ | +---------------------+---------------------+ | | | | | | | / \ / \ / \ | | / X1\----->/ \ / X2\----->/ \ / X3\----->/ \ | | ----- / Y1\ ----- / Y2\ ----- / Y3\ | | | ----- | ----- | ----- | +----|---------|-----------|---------|-----------|---------|----+
In the deployment above, using one DNS A
record per site
resolving to both Xi
and Yi
, if a Binlog Server fails,
its slaves will reconnect to the other one. If the Yi
Binlog Server fails, nothing more needs to be done. If the Xi
Binlog Server fails, the corresponding Yi
must be
repointed to the master. This repointing is easy, as, by design,
a Binlog Server is identical to its master. Only the destination
server must be changed, and the binary log filename and position stay
the same.
When the Master Fails...
Equipped with the above implementation of the DBSS, in a situation when the master fails, we end up with the state below; each site might be at a different position in the binary log stream of the failed master.
+---------------------------------------------------------------+ | | | / \ / \ / \ | | / X1\----->/ \ / X2\----->/ \ / X3\----->/ \ | | ----- / Y1\ ----- / Y2\ ----- / Y3\ | | | ----- | ----- | ----- | +----|---------|-----------|---------|-----------|---------|----+
The first step of master promotion is to level the Binlog Servers
in the DBSS. To do so, the most up-do-date Binlog Server must be found
and all other Binlog Servers must be chained to it. In the deployment
above, only three servers must be contacted, which is much easier than
tens of slaves. If the most up-to-date Binlog Server is X2
,
levelling the Binlog Servers consists of the temporary replication
architecture below.
+---------------------------------------------------------------+ | | | / \ <-----------------/ \-----------------> / \ | | / X1\----->/ \ / X2\----->/ \ / X3\----->/ \ | | ----- / Y1\ ----- / Y2\ ----- / Y3\ | | | ----- | ----- | ----- | +----|---------|-----------|---------|-----------|---------|----+
Levelling should happen very quickly (if it does not,
one of the Binlog Servers is lagging, which should not happen).
After that, the slaves will quickly
follow. Once a slave is up to date (this actually does not need levelling,
a slave of X2
or Y2
could have been promoted before levelling),
master promotion can be performed.
Shown below, a slave from the third site on the right has been chosen
to be the new master, but any slave on any of the three sites could have
been used.
+------------------------------------------------|--------------+ | +---------------------+---------------------+ | | | | | | | / \ / \ / \ | | / X1\----->/ \ / X2\----->/ \ / X3\----->/ \ | | ----- / Y1\ ----- / Y2\ ----- / Y3\ | | | ----- | ----- | ----- | +----|---------|-----------|---------|-----------|---------|----+
Note that the other slaves have not been touched: they are still connected to their Binlog Server. This means that this solution works well even if one of the slaves is unavailable during master promotion. This solution also works very well with delayed or lagging slaves, as those slaves are simply not good candidates for becoming the new master. For some time, the lagging slaves will process the binary logs of the old master that are still stored on the Binlog Servers.
The Trick for not Reconfiguring every Slave
Promoting a slave to be the new master in a DBSS deployment
requires working some magic on a slave to make its binary log position
(SHOW MASTER STATUS
) matches what is expected by
the Binlog Servers. Let's take an example: if the last binary log
stored on the levelled Binlog Servers is binlog.000163
,
we could repoint the Binlog Servers to a new master if the
SHOW MASTER STATUS
of this new master is at the beginning
of binary log filename binlog.000164
.
When doing that promotion, from the point of view of the Binlog Servers,
their master is simply restarted with a different server_id
and server_uuid
.
From the point of view of the
slaves, they are processing the binary logs of the old master
(up to and including binlog.000163
) followed by the binary logs
of the new master (starting at binlog.000164
).
So, the trick is to have our candidate master at the right binary log position. This can be made possible by:
- configuring all nodes with binary logging enabled,
- with all identical
log-bin
value (binlog
in the example above), - and without enabling
log-slave-updates
.
Configuration #3 above allows us to assume that the master will consume
binary log filenames much faster than the slaves. This way, the slaves
will always be behind the master in their binary log filenames [3].
As such, bringing a slave to the right binary log filename is as simple as
doing FLUSH BINARY LOGS
in a loop until the slave is in the correct
position. To avoid this loop from taking too much time, we can
run a cron job on our slaves that makes sure they are not too far
away from their master (maximum ten binary logs away, for an example).
Summary of Master Promotion
In the following replication deployment, with log-bin=binlog
and
with log-slave-updates
disabled:
+---+ | M | +---+ | +----+----------------------------------------------------------+ | | +----+---------+-----------+---------+-----------+---------+----+ | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ | S1| ... | Sn| | T1| ... | Tm| | U1| ... | Uo| +---+ +---+ +---+ +---+ +---+ +---+
If M
fails, we first level the Binlog Servers in the DBSS.
Once this is done, and let's take T1
as our candidate master,
we need to perform the following on it:
FLUSH BINARY LOGS
until the binary log filename follows the last one from the levelled DBSS,PURGE BINARY LOGS TO <latest binary log file>
,RESET SLAVE ALL
.
The step #2 above drops all binary logs on the new master that could conflict with the one from the previous master. The binary logs of the old master are stored on the DBSS and we must be sure to avoid having similar, but misleading data, on the new master.
We now have this:
+\-/+ | X | +/-\+ +---------------------------------------------------------------+ | | +----+---------+---------------------+-----------+---------+----+ | | | | | +---+ +---+ +---+ +---+ +---+ +---+ | S1| ... | Sn| | T1| ... | Tm| | U1| ... | Uo| +---+ +---+ +---+ +---+ +---+ +---+
where we repoint the DBSS to T1 to get the following:
+\-/+ +---+ | X | | T1| +/-\+ +---+ | +--------------------------+------------------------------------+ | | +----+---------+-----------+---------+-----------+---------+----+ | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ | S1| ... | Sn| | T2| ... | Tm| | U1| ... | Uo| +---+ +---+ +---+ +---+ +---+ +---+
and we have achieved master promotion without reconfiguring all slaves.
A Cleaner Way
The trick above works well, but preforming FLUSH BINARY LOGS
in a loop is not
the cleanest of solutions. It would be much better if there was a way
to set the binary log to the desired filename in a single
operation. With this idea in mind, we created the following two feature
requests:
MariaDB 10.1.6 is already implementing a RESET MASTER TO syntax. Let's hope that Oracle will provide something similar in MySQL 5.7.
What about the Software?
This idea and procedure is all well and good, but it is not very useful if you cannot use it yourself. The currently available version of the Binlog Server, the MaxScale Binlog Router plugin, does not yet implement all the configuration hooks needed to make this procedure easy. Booking.com is currently working with MariaDB to implement the missing hooks in a new version of MaxScale. We are in the last testing phase of a Binlog Router plugin that support the following:
STOP SLAVE
,START SLAVE
,SHOW MASTER STATUS
,SHOW SLAVE STATUS
,CHANGE MASTER TO
: these new commands allow easier configuration of the Binlog Server.- The
CHANGE MASTER TO
command not only allows to easily chain Binlog Servers, but also to bootstrap a Binlog Server without editing the configuration file. Moreover, this command allows to repoint MaxScale to a new master at binary log filenameN+1
, effectively enabling to perform master promotion. - Transaction safety: when the master fails, the Binlog Server could have downloaded a partial transaction. If we replace the master with a slave, this transaction should not be sent to slaves. So this feature of the next version of MaxScale will make sure such partial transactions are not sent downstream.
- DBSS identity: the initial design of the Binlog Server was intended to
impersonate the master, and did not consider swapping the master
at the top of the hierarchy. In a
DBSS deployment, swapping the master should not be made visible to slaves,
so the Binlog Servers should present the slave with a different
server_id
andserver_uuid
to those of the master. The next version of the MaxScale Binlog Router supports that virtual master feature.
This next version of the MaxScale Binlog Router will be generally available once we are done with the testing. Stay tuned on the MariaDB web site for the announcement and the failover procedure. In the meantime, you can still experiment with master promotion without reconfiguring all slaves by using the current version of MaxScale and following this proof of concept procedure.
If you are interested in this topic and would like to learn more, I am giving a talk about Binlog Servers at Percona Live Amsterdam. Feel free to grab me after the talk, catch me at the Booking.com booth (#205) or share a drink with me at the Community Dinner, to exchange thoughts on this subject. (You can also post a comment below.)
I will also be giving a talk about Binlog Servers at Oracle Open World in San Francisco at the end of October.
One last thing: if you want to know more about other cool things we do at Booking.com, I suggest you come to our other talks at Percona Live Amsterdam in September:
- The Virtues of Boring Technology
- Combining Redis and MySQL to store HTTP cookie data
- Events storage and analysis with Riak at Booking.com
- Encrypted MySQL Backups and instant recoverability on large scale
- Unicode and MySQL
- Managing and Visualizing your replication topologies with Orchestrator
- Your Clone Army: Better scalability through more database servers
- Riding the Binlog: an in Deep Dissection of the Replication Stream
[1] Slave levelling can be done with MHA, with MySQL 5.6 or MariaDB 10.0 GTIDs, or with Pseudo-GTIDs when using earlier versions of MySQL and MariaDB.
[2] If we were not concerned about WAN bandwidth, all Binlog Servers could be directly connected to the master. Another solution could be to connect all master-local Binlog Servers directly to the master and to use the chained strategy for remote Binlog Servers. (This hybrid deployment could be well-suited to a semi-sync deployment, but I am diverging from the subject of this post.)
[3] The same can be achieved when using
log-slave-updates
, by using smaller
max_binlog_size
on the master than on all the slaves.