(The previous post, Better Parallel Replication for MySQL, is Part 1 of the series.) Parallel replication is a much expected feature of MySQL. It is already available in MariaDB 10.0 and in MySQL 5.7. In this post, a very nice side effect of the MariaDB implementation is presented: Slave Group Commit.
(If you are not familiar with parallel replication and its implementation, read the previous post in this series to better understand this post.)
To allow better parallelization on slaves, MariaDB 10.0 and MySQL 5.7
have parameters that control the commit group sizes on the
master. Those parameters are
binlog_commit_wait_count
/ binlog_commit_wait_usec
in MariaDB and
binlog_group_commit_sync_delay
/ binlog_group_commit_sync_no_delay_count
in MySQL. The purpose of those parameters is to delay the commit
of transactions on the master to allow more transactions to join a
commit group. This increases the commit group size, and thus the number of
transactions that can be run in parallel on slaves. Those parameters can be
seen as a way to speed up slaves at the expense of slowing down the
master.
When using those parameters on a MariaDB slave with parallel
replication enabled (slave_parallel_threads
> 1) and when this slave
has a MariaDB 10.0 master running a single-threaded workload, transactions will be run
sequentially with their commits delayed to try
grouping transactions, hence achieving group commit on the slave.
Let's take the previous sentence one element at a time and explain each step in more detail.
In the replication hierarchy shown below:
X
is the master (any of MySQL 5.5, MySQL 5.6, MariaDB 5.5 or MariaDB 10.0)Y
is an intermediate master running MariaDB 10.0Z
is a slave running MariaDB 10.0 with binary logs and log-slave-updates enabled
----- ----- ----- | X | --> | Y | --> | Z | ----- ----- -----
The Y
intermediate master is configured with the following parameters
(single threaded slave):
SET GLOBAL slave_parallel_threads = 0;
SET GLOBAL binlog_commit_wait_count = 0;
SET GLOBAL binlog_commit_wait_usec = 0;
And the Z
slave is configured with the following parameters:
SET GLOBAL slave_parallel_threads =
"value greater than 1";
SET GLOBAL binlog_commit_wait_count =
"value greater than 1";
SET GLOBAL binlog_commit_wait_usec =
"value greater than 0";
Let's follow two transactions (T1
and T2
) on Y
and Z
.
On Y
, the transactions are executed sequentially as shown below
(B for begin and C for commit).
------Time-----> T1: B----C T2: B----C
Once those are in the binary logs of Y
, Z
will execute them. As Y
is
single-threaded, Z
cannot run the transactions in parallel (T1
and T2
are each in their
own commit group in the binary logs of Y
). However, as Z
is delaying commit
(binlog_commit_wait_count
> 1 and binlog_commit_wait_usec
> 0),
T2
will start executing in another thread once T1
is
ready to commit (slave_parallel_threads
> 1). If T2
completes before the
delay expires, T1
and T2
will group commit as shown below.
-----Time-----> T1: B---- . . C T2: B----C
Group committing is not limited to two transactions, it can extend to the transactions that follow and can result in much larger groups as shown below.
----------Time----------> T1: B-- . . . . . . . . C T2: B----- . . . . . C T3: B----- . . C T4: B----C
Transaction grouping will stop in the following three situations:
- The group size reaches
binlog_commit_wait_count
, grouping stops and all transactions get committed together. binlog_commit_wait_usec
expires, grouping stops,T1
and all the following completed transactions commit together.- While the commit of
T1
is delayed, a subsequent transaction is blocked by a lock held by a previous transaction that got delayed at the commit stage, a transaction dependency has been found and the group commit pipeline will stall untilbinlog_commit_wait_usec
expires (MariaDB 10.0.16).
Situation 3 above must be understood clearly. Each time a transaction
dependency is found, the slave stops applying transactions until
binlog_commit_wait_usec
expires. This means that if transaction
dependencies are frequent and binlog_commit_wait_usec
is too big,
the slave throughput will suffer. This could result in slave lag
as shown below (T3
depends on T1
or T2
).
----------------Time-----------> T1: B-- . . . . . . . . C T2: B----- . . . . . C T3: B-----C
This could be optimized in one of the next versions of MariaDB by detecting dependencies and triggering commit omitting the waiting [1].
(Note: in MariaDB 10.1, slave group commit can be enabled with the minimalslave_parallel_mode
.)
Back to group committing on slaves, it allows the following:
- Identifying transaction parallelism on a slave.
- Speeding up the slave by reducing the number of disk syncs
needed to write the binary logs when strong
durability constraints are set (
sync_binlog
= 1 andinnodb_flush_log_at_trx_commit
= 1).
With reference to 1 above, this way of identifying parallelism will be the starting point of the next post in this series. Part 3 of the series will present speedups of parallel applying on slaves using Booking.com production workload. Moreover, you now know how the graphs presented in part 1 of the series were obtained.
With reference to 2 above, this could be a solution to a problem we are facing at Booking.com. To understand this problem, let's look at a Booking.com typical MySQL deployment:
----- | M | ----- | +------+- ... -+---------------+ | | | | ----- ----- ----- ----- | S1| | S2| | Sn| | M1| ----- ----- ----- ----- | +- ... -+ | | ----- ----- | T1| | Tm| ----- -----
M
is the standard masterSi
are the slaves of this master in the same datacenterM1
is an intermediate master on a remote datacenterTj
are the slaves of this intermediate master in the remote datacenter
To be able to survive the hardware failure of M1
, the database is hosted on shared disks.
Such shared disks could be implemented using DRBD or using filer-based
storage accessed over fiber channel (SAN) or iSCSI (LAN).
To be able to recover the database (and have
the Tj
slaves recover replication), high durability constraints
need to be set on M1
(sync_binlog
= 1 and innodb_flush_log_at_trx_commit
= 1).
However, high durability constraints on a single-threaded slave mean that, after each transaction, a sync to the binary logs and a sync to the REDO logs are required. When a sync takes 1 millisecond (network round-trip), we cannot run more than 500 (or 1000 [2]) transactions per second.
The other leaf slaves (Si
and Tj
) do
not have this limitation as their databases are hosted on the local disks
with a battery backed-up write cache on the RAID controller (syncs
are in fact just writes in the controller memory, so they are less expensive).
Moreover, those slaves do not need binary logs
and we can relax durability on them (innodb_flush_log_at_trx_commit
= 0 or 2).
So M1
is an embarrassing bottleneck in the replication,
and bursts of transactions on M
occasionally cause
replication delay in the remote datacenter.
Slave group commit could be a solution to alleviate this bottleneck, but would this work in the real life? To validate that, we did the following test:
----- ----- ----- | A | --> | B | --> | C | ----- ----- -----
A
is our true production master running MySQL 5.6B
is an intermediate master running MariaDB 10.0.16 without parallel replication enabled (slave_parallel_threads
= 0)C
is a MariaDB 10.0.16 slave with binary log and log-slave-updates enabled, and with high durability enforced (sync_binlog
= 1 andinnodb_flush_log_at_trx_commit
= 1)
The database on C
is hosted on local disks with a battery backed-up
write cache on the RAID controller. We ran the C
slave with the following
configuration:
- Until 17:05 in the graphs below,
C
is running in single-threaded mode (slave_parallel_threads
= 0) - At 17:05, slave group committing is enabled (
slave_parallel_threads
= 20,binlog_commit_wait_count
= 15 andbinlog_commit_wait_usec
= 40.000) - At 17:10, the write cache of the RAID controller is disabled (the slave is still group committing)
- At 17:15, group committing is disabled (
slave_parallel_threads
= 0binlog_commit_wait_count
= 0 andbinlog_commit_wait_usec
= 0) - Finally at 17:17, the write cache is re-enabled
During those tests, the BINLOG_COMMITS
and BINLOG_GROUP_COMMITS
global
statuses are gathered regularly giving the following graphs:
As we can see on the graphs above, in single-threaded mode and with the
write cache enabled, the slaves are able to keep up with the flow of
transactions from the master. When enabling group commit (at 17:05), the slaves are
still able to keep up with the flow of transactions
(binlog_commit_wait_usec
does not cause a delay). We can see small
behaviour changes after 17:10 when the write cache is disabled but three
out of four slaves are able to keep up with the flow of transactions
(the 3rd graph shows some slow down because lots of dirty page writes
are happening at the same time, but no major delays happen).
When the slaves are put back in single-threaded mode and stop group committing (17:15), we can see
that their transaction processing capabilities are greatly impaired and
that they are not able to keep up with the masters (they are committing
much less than the master).
This shows that slave group commit is a good way of increasing throughput on slaves that have expensive disk sync (no write cache or SAN/LAN-attached storage) when binary logs, log-slave-updates and high durability settings are needed.
You might have noticed that the B
intermediate master does not play a
very important role in these tests. It is still needed as the parallel
replication code path in MariaDB 10.0.16 is only used when the master is
also running MariaDB 10.0. This restriction could
be lifted in one of the next versions of MariaDB and then B
could be removed.
This concludes this post on slave group commit. The next post will present speedup results of MariaDB 10.0 slaves running parallel replication on true production workload (with parallelism identified by a group committing intermediate master).
One last thing, if you want to know more about the cool things we do with MySQL at Booking.com and if you are at Percona Live Santa Clara in April, we are giving three talks:
- Booking.com: Evolution of MySQL system design
- Binlog Servers at Booking.com
- Pseudo GTID and easy replication management
You can also come to meet us at booth 315. We are looking for smart people to work with us on MySQL and help us to solve hard problems. See our careers page for more details.
[1] After completing this post, it was brought to our attention that the not yet released MariaDB versions 10.0.18 and 10.1.4 include MDEV-7249 that implement skipping the delay when a transaction dependency is found.
[2] 500 transactions if two syncs are needed per transaction, 1000 transactions if only one is needed: the exact details are not obvious, if you knows more about that, please post a comment below.