Booking.com dev blog

Without network operations, running a website for booking accommodation online would be nearly impossible. Network operations can be anything from simple actions like talking to a browser with a user at the other end, to providing an API for our affiliates, or even writing the internal services that help maintain our systems.

Network operations are everywhere, and these are only a few examples of where we use them.

What is a network socket

Network communication typically happens through the use of sockets.

A network socket is one of the core software components of all modern operating systems. It is a software interface for network services provided by the operating system. It provides a uniformed way of opening a connection to this service and sends and receives data.

Network sockets are used everywhere in the IT world, and they allow us to communicate between different hosts or different programs on the same host. Despite having different kinds of network services (TCP and UDP being prominent examples), network sockets provide a common way to interact with them.

Here is an example of interacting with the Google.com website using IO::Socket::INET, the standard Perl socket module. (IO means Input/Output, and INET means Internet.)

# example 1my$socket=IO::Socket::INET->new('google.com:80');print{$socket}"GET / \n";my$html=join'',<$socket>;

Interestingly, IO::Socket::INET is mostly used for its Object Oriented capable interface. The following example performs the same operations as the previous one, but in an object oriented way:

# example 2my$socket=IO::Socket::INET->new(PeerHost=>'www.booking.com',PeerPort=>80,);$socket->print("GET / \n");my$html=join'',$socket->getlines();

Why sockets timeouts are important

At Booking.com, the handling of requests in a timely manner is critical to the user experience, to operations, and ultimately to our business. To achieve speed and low latency, our platform involves different subsystems and resources are constantly requested.

It is essential that these systems reply quickly. It is also vital that we detect when one of them isn't replying fast enough, that way it can be flagged and a mitigating strategy can be found (such as using an alternative subsystem).

We use Redis in a lot of places: as a cache layer, queue system, and for storage. Redis offers very low latency, and we make use of this feature. However, with sockets, we don't always know immediately when a connection has been lost. Knowing a Redis server is unreachable 30 seconds after the fact - is 30 seconds too late. Our goal is to know this in under a second. For other cases it might be possible (or even mandatory) to allow a longer timeout. It really depends on the subsystems involved as well as the usage.

Most of the time these subsystems are queried using a network socket. So being able to detect that a subsystem is not reachable implies that the sockets provide a way to specify timeouts.

This is why having a fast and reliable platform relies on having sockets that support timeouts. Using a socket involves three main steps: connecting to the external server, reading and writing data from and to it, and, at some point, closing the connection. A socket timeout implementation should allow for setting the timeout at connections, and both reading and writing steps at the very least.

Connection timeout

IO::Socket provides a timeout method, and IO::Socket::INET provides a Timeout option. The Timeout option can be used to set a timeout on the connection to the server. For example, this is how we connect to a local HTTP server on port 80 with a connection timeout of 3 seconds:

my$socket=IO::Socket::INET->new(PeerHost=>'127.0.0.1',PeerPort=>80,Timeout=>3,);

So far so good, but how do we deal with read or write timeouts? What if the server accepts the connection, but then at some point stops communicating? The client socket needs to realize this quickly. We need timeouts for this.

Read/Write timeouts via setsockopt

It is relatively easy to change the option of a socket to supply these timeouts. This is an example that works on GNU/Linux, given $timeout in (optionally fractional) seconds:

my$seconds=int($timeout);my$useconds=int(1_000_000*($timeout-$seconds));my$timeout=pack('l!l!',$seconds,$useconds);$socket->setsockopt(SOL_SOCKET,SO_RCVTIMEO,$timeout)# then use $socket as usual

The only problem is that it only works on some architecture and operating systems. A generic solution is better. Let's look at the available options on systems that do not support setsockopt.

Read/Write timeouts via select

Another more portable (albeit slower) way to simulate a timeout on a socket is to check if the socket is readable/writable with a timeout in a non-blocking way. select(2) can do this, and the Perl select() function can provide access to it.

Here is a simplified version of a function that returns true if we can read the socket with the given timeout:

sub _can_read{my($file_desc,$timeout)=@_;vec(my$fdset='',$file_desc,1)=1;my$nfound=select($fdset,undef,undef,$timeout);}

Using an external library

Yet another way is to use external modules or system calls, like epoll (via IO::Poll), libevent, or libev. To simplify things, it's common to use higher-level event-based modules like AnyEvent and POE. They make it easy to specify a timeout to any IO (Input/Output) operations.

This is an example using AnyEvent, which will set a connection timeout of 0.5 second and a read or write timeout of 0.01 second:

my$handle=AnyEvent::Handle->new(connect=>[$host,$port],on_prepare=>sub {0.5},# ...);$handle->on_timeout(sub {say'timeout occurred'});$handle->timeout(0.01);

While completely valid, it applies only to programs that use these event-based modules. It is useless to standard imperative programs. We need a method for providing timeout features to the standard socket API without changing the operation needed to require an event loop.

Provide a nice API

Let's step back for a moment. We have two ways to setup a timeout on a socket:

A one-time setting on the socket with setsocket.
A change to the way we interact with the socket with select.

We need to abstract these two ways of setting timeouts behind a simple and easy-to-use API. Let's consider this example:

my$socket=IO::Socket::INET->new(...);print{$socket}'something';

(Please note that we don't use object-oriented notations on the socket.)

What we want is an easier way to set timeout on the $socket. For example this:

my$socket=IO::Socket::INET->new(...);# set timeouts$socket->read_timeout(0.5);# use the socket as beforeprint{$socket}'something';# later, get the timeout valuemy$timeout=$socket->read_timeout();

when using setsockopt

If we can use setsockopt, setting the timeout using ->read_timeout(0.5) is easy. It can be implemented as a method that we add to IO::Socket::INET class, possibly by using a Role.

This method would just fire setsockopt with the right parameters, and save the timeout value into $socket for later retrieval. Then we can carry on using $socket as before.

One subtlety is that, because the $socket is not a classic hash reference instance, but an anonymous typeglob on a hash reference, instead of doing $socket->{ReadTimeout} = 0.5 we need to do ${*$socket}{ReadTimeout} = 0.5 ... but that's just an implementation detail.

when using select

If however the program is running in a situation where setsockopt can't be used, we have to resort to using the select method. That poses a problem. Because we're not using object oriented programming, the operation on the socket is not done via a method we could easily override, but directly using the built-in function print.

Overwriting a core function is not a good practice for various reasons. Luckily, Perl provides a clean way to implement custom behavior in the IO layer.

PerlIO layers

Perl Input/Output mechanism is based on a system of layers. It is documented in the perliol(1) man page.

What's the PerlIO API? It's a stack of layers that live between the system and the perl generic file-handle API. Perl provides core layers (such as :unix, :perlio, :stdio, and :crlf). It also provides extension layers (such as :encoding and :via).

These layers can be stacked and removed in order to provide more features (when layers are added) or more performance (when layers are removed).

The huge benefit is that no matter which layers are setup on a file handle or socket, the API doesn't change and the read/write operations are the same. Calls to them will go through the specified layers attached to the handle until they potentially reach the system calls.

Here is an example:

openmy$fh,'filename';# for direct binary non-buffered accessbinmode$fh,':raw';# specify that the file is in utf8, and enforce validationbinmode$fh,':encoding(UTF-8)';my$line=<$fh>;

The :via layer is a special layer that allows anyone to implement a PerlIO layer in pure Perl. Contrary to implementing a PerlIO layer in C, using the :via layer is rather easy: it is just a Perl class, with some specific methods. The name of the class is given when setting the layer:

binmode$fh,':via(MyOwnLayer)';

Many :via layers already exist. They all start with PerlIO::via:: and are available on CPAN. For instance, PerlIO::via::json will automatically and transparently decode and encode the content of a file or a socket from and to JSON.

Back to the problem. We could implement a :via layer that makes sure that read and write operations on the underlying handle are performed within the given timeout.

Implementing a timeout PerlIO layer

A :via layer is a class that should start with PerlIO::via:: and implement a set of methods, like READ, WRITE, PUSHED, and POPPED - (see the PerlIO::via manual for more details).

Let's take the READ method as an illustration. This is a very simplified version. The real version handles EINTR and other corner cases.

packagePerlIO::via::Timeout;sub READ{my($self,$buf,$len,$fh)=@_;my$fd=fileno($fh);# we use the same can_read as previouslycan_read($fd,$timeout)orreturn0;returnsysread($fh,$buf,$len,0);}

The idea is to check if we can read on the filesystem using select in the given timeout. If not, return 0. If yes, call the normal sysread operation. It's simple and it works great.

We've just implemented a new PerlIO layer using the :via mechanism! A PerlIO layer works on any handle, including file and socket. Let's try it on a file-handle:

usePerlIO::via::Timeout;openmy$fh,'<:via(Timeout)','foo.html';my$line=<$fh>;if($line==undef&&0+$!==ETIMEDOUT){# timed out reading...}else{# we read one line fast enough, success!...}

I'm sure you can see that there is an issue in the code above. At no point do we set the read timeout value. The :via pseudo-layer doesn't allow us to easily pass a parameter to the layer creation. Though we can technically, we would not be able to change the parameter afterwards. If we want to be able to set, change, or remove the timeout on the handle at any time, we need to somehow attach this information to the handle, and we need to be able to change it.

Add a properties to a Handle using InsideOut OO

A handle is not an object. We can't just add a new timeout attribute to a handle and then set or get it.

Luckily, the moment a handle is opened it receives a unique ID: its file descriptor. A file descriptor is not always unique because they are recycled and reused. Yet, if we know when a handle is opened and closed we can be sure that between these actions a file descriptor is given that uniquely identifies it.

The :via PerlIO layer allows us to implement PUSHED, POPPED, and CLOSE. These functions are called when the layer is added to the handle, when it's removed, and when the handle is closed. We can use these functions to detect if and when to consider the file descriptor as a unique ID for the given handle.

We can create a hash table as a class attribute of our new layer. Here the keys are file descriptors and the values are a set of properties on the associated handle -- essentially a basic implementation of Inside-Out OO - with the object not being its data structure only an ID. Using this hash table, we can associate a set of properties to a file descriptor and set the timeout value when the PerlIO layer is added. Like this:

my%fd_properties;sub PUSHED{my($class,$mode,$fh)=@_;$fd_properties{fileno($fh)}={read_timeout=>0.5};# ...}

By doing this when we remove the layer too, we can also implement a way to associate timeout values to the file-handle.

Wrapping up all the bits of code and features, the full package that implements this timeout layer, PerlIO::via::Timeout, is available on Github and CPAN.

Implement the API

We now have all the ingredients we need to implement the desired behavior. enable_timeouts_on will receive the socket and modify its class (which should be IO::Socket::INET or inherited from it) to implement these methods:

read_timeout: get/set the read timeout
write_timeout: get/set the write timeout
disable_timeout: switch off timeouts (while remembering their values)
enable_timeout: switch back on the timeouts
timeout_enabled: returns whether the timeouts are enabled

In order to modify the IO::Socket::INET class in a clean way, let's create a role and apply it to the class. In fact, let's create two roles: one that implements the various methods using setsockopt and another role that uses select (with PerlIO::via::Timeout).

A Role (sometimes known as Trait) provides additional behavior to a class in the form of composition. Roles provide introspection, mutual exclusion capabilities, and horizontal composition instead of the more widely used inheritance model. A class simply consumes a Role, receiving any and all behavior the Role provides, whether these are attributes, methods, method modifiers, or even constraints on the consuming class.

Detailing the implementation of the role mechanism here is a bit out of the scope, but it's still interesting to note that to keep IO::Socket::Timeout lightweight, we don't use Moose::Role or Moo::Role, but instead we apply a stripped down variant of Role::Tiny, which uses a single inheritance of a special class crafted in real time specifically for the targeted class. The code is short and can be seen here.

wrap it up

Use IO::Socket::Timeout to add read/write timeouts to any network socket created with IO::Socket::INET, on any platform:

# 1. Creates a socket as usualmy$socket=IO::Socket::INET->new(...);# 2. Enable read and write timeouts on the socketIO::Socket::Timeout->enable_timeouts_on($socket);# 3. Setup the timeouts$socket->read_timeout(0.5);$socket->write_timeout(0.5);# 4. Use the socket as usualmy$data=<$socket>;# 5. Profit!

Conclusion

IO::Socket::Timeout provides a lightweight, generic, and portable way of applying timeouts on sockets, and it plays an important role in the stability of the interaction between our subsystems at Booking.com. This wouldn't be possible without Perl's extreme flexibility.

Please be aware that there is a performance penalty associated with implementing IO layers in pure Perl. If you are worried about this, we recommend benchmarking when making the decision on whether to use it.

IO::Socket::Timeout is available on GitHub and CPAN.

Parallel replication is a much-expected feature of MySQL. It’s available in MariaDB 10.0 and in MySQL 5.7. Yet, both lose efficiency when replicating through intermediate masters. In this post, we’ll explain how parallel replication works and why it does not play well with intermediate masters. We’ll also offer a solution (hint: it involves Binlog Servers).

In MySQL 5.5 and MariaDB 5.5, replication is single threaded: on a slave, the previous transaction must complete (commit) before the next transaction can start. This is not the case on a master where many transactions can make progress at the same time. The consequence is that a slave has less transaction processing capabilities than its master (for writes). This means that care must be taken to avoid loading a master with more writes than its slaves can execute. If this is not done well, the slaves will start lagging.

This single threaded implementation is disappointing:

Year after year, servers get an increasing number of cores but only one can be used on slaves to apply transactions from the master
MySQL (and its storage engines) gets better at using many cores but those improvements do not speed up replication on slaves

On the other hand, running transactions in parallel on a slave is not simple. If done badly, conflicting transactions could be run at the same time, which risks getting different results on master and slave. In order to be able to run transactions in parallel, a way to prevent that from happening must be implemented. For parallel replication in MySQL 5.6, transactions in different schemas are assumed to be non-conflicting and can be run in parallel on slaves. This is a good start but what about transactions in the same schema?

To provide parallel execution of transactions in the same schema, MariaDB 10.0 and MySQL 5.7 take advantage of the binary log group commit optimization. This optimization reduces the number of operations needed to produce the binary logs by grouping transactions. When transactions are committing at the same time, they are written to the binary log in a single operation. But if transactions commit at the same time, then they are not sharing any locks, which means they are not conflicting thus can be executed in parallel on slaves. So by adding group commit information in the binary logs on the master, the slaves can safely run transactions in parallel.

So far we have:

The master identifying potential parallelism with group commit and making this information available to the slaves via the binary logs
The slaves executing transactions in parallel using the information provided by the master

What happens when some of those slaves are intermediate masters? What kind of parallelism information will they pass to their slaves? To answer these questions, the first thing that needs to be highlighted is the fact that transactions committing together might have started at different points of time in the past. This situation is illustrated below where T1 and T2 started (B for begin) after T3 and T4, even if they are all committing at the same time (C for commit).

----Time---->T1:B---CT2:B---CT3:B-------CT4:B-------C

Because those transactions commit together on the master, they start together on the slaves but do not commit at the same time:

----Time---->T1:B---CT2:B---CT3:B-------CT4:B-------C

This will cause the group of 4 transactions from the master to be split into 2 groups of 2 transactions on slaves. This is not a big problem on leaf slaves but it is a major issue on intermediate masters. As an intermediate master will write those transactions as 2 commit groups in the binary logs, its slaves will not run 4 transactions in parallel but only 2 transactions at a time with a barrier between the groups as shown below.

----Time--------->T1:B---CT2:B---CT3:B-------CT4:B-------C

So, when using parallel replication (based on the group commit) with transactions of different execution times, the intermediate masters show less parallelism to their slaves than what is shown by their master. This is not only a supposition, this is clearly observed in our test environments.

To test parallel replication at Booking.com we set up four test environments, each consisting of a master and a slave, where each node is running MariaDB 10.0.16 (we’ve also done similiar tests with MySQL 5.7 and found results close to the ones you’ll see below). On each master, we run transactions from production and identify parallelism using group commit (more on the exact methodology in a follow up post). To monitor parallelism identification, the global status BINLOG_COMMITS and BINLOG_GROUP_COMMITS are checked regularly. The graphs below show the collected data (number of commits and group commits) for both master and slave in the four tests environments.

E1 commits and group commits

E2 commits and group commits

E3 commits and group commits

E4 commits and group commits

For the most part, the master and the slave are executing the same number of transactions (commits). But, the slave is executing more group commits than the master (less is better). If we divide the number of commits by the number of group commits, we get the average group commit size. Those are shown on the following graphs with the number of applier threads on the slave:

E1 group commit sizes

E2 group commit sizes

E3 group commit sizes

E4 group commit sizes

A smaller group size means that, if those slaves were intermediate masters, parallelism is lost when replicating through them. Looking at ratios, we can see how bad this is (slave divided by master, less than one for group commit means that the slave identifies less parallelism than its master). Note: we expect the commit ratio to be 1, as if it were not, the slave would be lagging.

E1 commit and group commit ratios

E2 commit and group commit ratios

E3 commit and group commit ratios

E4 commit and group commit ratios

This is very disappointing. In our test environments, using true production workload, intermediate masters divide the parallelism available to their slaves by at least 2 and sometimes by almost 4.

What can be done to improve that? Having all slaves replicating directly from the master could be a solution. But that would increase the load on masters. It would also be impractical for remote site replication because it consumes too much WAN bandwidth as we often have 50 and sometimes more than 100 slaves per site.

People that have read previous posts from this blog might have guessed the solution:

Replace intermediate masters by Binlog Servers.

(Reminder: a Binlog Server sends its slaves exactly the same binary logs it fetched from its master.)

By doing this, all slaves replicate using the same binary logs (the one from the master) and all slaves, either directly connected to the master or replicating from a Binlog Server, see the same parallelism information.

To take full advantage of parallel replication, and while intermediate masters are unable to offer this information to their slaves, intermediate masters must disappear! The tool that will allow us to do that at Booking.com is the Binlog Server.

If you're interested in this topic and would like to learn more, I’ll be giving a talk about Binlog Servers at Percona Live Santa Clara in April. I am looking forward to meeting you there.

(The previous post, Better Parallel Replication for MySQL, is Part 1 of the series.) Parallel replication is a much expected feature of MySQL. It is already available in MariaDB 10.0 and in MySQL 5.7. In this post, a very nice side effect of the MariaDB implementation is presented: Slave Group Commit.

(If you are not familiar with parallel replication and its implementation, read the previous post in this series to better understand this post.)

To allow better parallelization on slaves, MariaDB 10.0 and MySQL 5.7 have parameters that control the commit group sizes on the master. Those parameters are binlog_commit_wait_count / binlog_commit_wait_usec in MariaDB and binlog_group_commit_sync_delay / binlog_group_commit_sync_no_delay_count in MySQL. The purpose of those parameters is to delay the commit of transactions on the master to allow more transactions to join a commit group. This increases the commit group size, and thus the number of transactions that can be run in parallel on slaves. Those parameters can be seen as a way to speed up slaves at the expense of slowing down the master.

When using those parameters on a MariaDB slave with parallel replication enabled (slave_parallel_threads> 1) and when this slave has a MariaDB 10.0 master running a single-threaded workload, transactions will be run sequentially with their commits delayed to try grouping transactions, hence achieving group commit on the slave.

Let's take the previous sentence one element at a time and explain each step in more detail.

In the replication hierarchy shown below:

X is the master (any of MySQL 5.5, MySQL 5.6, MariaDB 5.5 or MariaDB 10.0)
Y is an intermediate master running MariaDB 10.0
Z is a slave running MariaDB 10.0 with binary logs and log-slave-updates enabled

-----     -----     -----
| X | --> | Y | --> | Z |
-----     -----     -----

The Y intermediate master is configured with the following parameters (single threaded slave):

SET GLOBAL slave_parallel_threads = 0;
SET GLOBAL binlog_commit_wait_count = 0;
SET GLOBAL binlog_commit_wait_usec = 0;

And the Z slave is configured with the following parameters:

SET GLOBAL slave_parallel_threads ="value greater than 1";
SET GLOBAL binlog_commit_wait_count ="value greater than 1";
SET GLOBAL binlog_commit_wait_usec ="value greater than 0";

Let's follow two transactions (T1 and T2) on Y and Z. On Y, the transactions are executed sequentially as shown below (B for begin and C for commit).

   ------Time----->
T1:  B----C
T2:        B----C

Once those are in the binary logs of Y, Z will execute them. As Y is single-threaded, Z cannot run the transactions in parallel (T1 and T2 are each in their own commit group in the binary logs of Y). However, as Z is delaying commit (binlog_commit_wait_count> 1 and binlog_commit_wait_usec> 0), T2 will start executing in another thread once T1 is ready to commit (slave_parallel_threads> 1). If T2 completes before the delay expires, T1 and T2 will group commit as shown below.

   -----Time----->
T1:  B---- . . C
T2:       B----C

Group committing is not limited to two transactions, it can extend to the transactions that follow and can result in much larger groups as shown below.

   ----------Time---------->
T1:  B-- . . . . . . . . C
T2:     B----- . . . . . C
T3:           B----- . . C
T4:                 B----C

Transaction grouping will stop in the following three situations:

The group size reaches binlog_commit_wait_count, grouping stops and all transactions get committed together.
binlog_commit_wait_usec expires, grouping stops, T1 and all the following completed transactions commit together.
While the commit of T1 is delayed, a subsequent transaction is blocked by a lock held by a previous transaction that got delayed at the commit stage, a transaction dependency has been found and the group commit pipeline will stall until binlog_commit_wait_usec expires (MariaDB 10.0.16).

Situation 3 above must be understood clearly. Each time a transaction dependency is found, the slave stops applying transactions until binlog_commit_wait_usec expires. This means that if transaction dependencies are frequent and binlog_commit_wait_usec is too big, the slave throughput will suffer. This could result in slave lag as shown below (T3 depends on T1 or T2).

   ----------------Time----------->
T1:  B-- . . . . . . . . C
T2:     B----- . . . . . C
T3:                       B-----C

This could be optimized in one of the next versions of MariaDB by detecting dependencies and triggering commit omitting the waiting [1].

(Note: in MariaDB 10.1, slave group commit can be enabled with the minimalslave_parallel_mode.)

Back to group committing on slaves, it allows the following:

Identifying transaction parallelism on a slave.
Speeding up the slave by reducing the number of disk syncs needed to write the binary logs when strong durability constraints are set (sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1).

With reference to 1 above, this way of identifying parallelism will be the starting point of the next post in this series. Part 3 of the series will present speedups of parallel applying on slaves using Booking.com production workload. Moreover, you now know how the graphs presented in part 1 of the series were obtained.

With reference to 2 above, this could be a solution to a problem we are facing at Booking.com. To understand this problem, let's look at a Booking.com typical MySQL deployment:

-----
| M |
-----
  |
  +------+- ... -+---------------+
  |      |       |               |
-----  -----   -----           -----
| S1|  | S2|   | Sn|           | M1|
-----  -----   -----           -----
                                 |
                                 +- ... -+
                                 |       |
                               -----   -----
                               | T1|   | Tm|
                               -----   -----

M is the standard master
Si are the slaves of this master in the same datacenter
M1 is an intermediate master on a remote datacenter
Tj are the slaves of this intermediate master in the remote datacenter

To be able to survive the hardware failure of M1, the database is hosted on shared disks. Such shared disks could be implemented using DRBD or using filer-based storage accessed over fiber channel (SAN) or iSCSI (LAN). To be able to recover the database (and have the Tj slaves recover replication), high durability constraints need to be set on M1 (sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1).

However, high durability constraints on a single-threaded slave mean that, after each transaction, a sync to the binary logs and a sync to the REDO logs are required. When a sync takes 1 millisecond (network round-trip), we cannot run more than 500 (or 1000 [2]) transactions per second.

The other leaf slaves (Si and Tj) do not have this limitation as their databases are hosted on the local disks with a battery backed-up write cache on the RAID controller (syncs are in fact just writes in the controller memory, so they are less expensive). Moreover, those slaves do not need binary logs and we can relax durability on them (innodb_flush_log_at_trx_commit = 0 or 2).

So M1 is an embarrassing bottleneck in the replication, and bursts of transactions on M occasionally cause replication delay in the remote datacenter.

Slave group commit could be a solution to alleviate this bottleneck, but would this work in the real life? To validate that, we did the following test:

-----     -----     -----
| A | --> | B | --> | C |
-----     -----     -----

A is our true production master running MySQL 5.6
B is an intermediate master running MariaDB 10.0.16 without parallel replication enabled (slave_parallel_threads = 0)
C is a MariaDB 10.0.16 slave with binary log and log-slave-updates enabled, and with high durability enforced (sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1)

The database on C is hosted on local disks with a battery backed-up write cache on the RAID controller. We ran the C slave with the following configuration:

Until 17:05 in the graphs below, C is running in single-threaded mode (slave_parallel_threads = 0)
At 17:05, slave group committing is enabled (slave_parallel_threads = 20, binlog_commit_wait_count = 15 and binlog_commit_wait_usec = 40.000)
At 17:10, the write cache of the RAID controller is disabled (the slave is still group committing)
At 17:15, group committing is disabled (slave_parallel_threads = 0 binlog_commit_wait_count = 0 and binlog_commit_wait_usec = 0)
Finally at 17:17, the write cache is re-enabled

During those tests, the BINLOG_COMMITS and BINLOG_GROUP_COMMITS global statuses are gathered regularly giving the following graphs:

E1 commits and group commits

E2 commits and group commits

E3 commits and group commits

E4 commits and group commits

As we can see on the graphs above, in single-threaded mode and with the write cache enabled, the slaves are able to keep up with the flow of transactions from the master. When enabling group commit (at 17:05), the slaves are still able to keep up with the flow of transactions (binlog_commit_wait_usec does not cause a delay). We can see small behaviour changes after 17:10 when the write cache is disabled but three out of four slaves are able to keep up with the flow of transactions (the 3rd graph shows some slow down because lots of dirty page writes are happening at the same time, but no major delays happen). When the slaves are put back in single-threaded mode and stop group committing (17:15), we can see that their transaction processing capabilities are greatly impaired and that they are not able to keep up with the masters (they are committing much less than the master).

This shows that slave group commit is a good way of increasing throughput on slaves that have expensive disk sync (no write cache or SAN/LAN-attached storage) when binary logs, log-slave-updates and high durability settings are needed.

You might have noticed that the B intermediate master does not play a very important role in these tests. It is still needed as the parallel replication code path in MariaDB 10.0.16 is only used when the master is also running MariaDB 10.0. This restriction could be lifted in one of the next versions of MariaDB and then B could be removed.

This concludes this post on slave group commit. The next post will present speedup results of MariaDB 10.0 slaves running parallel replication on true production workload (with parallelism identified by a group committing intermediate master).

One last thing, if you want to know more about the cool things we do with MySQL at Booking.com and if you are at Percona Live Santa Clara in April, we are giving three talks:

You can also come to meet us at booth 315. We are looking for smart people to work with us on MySQL and help us to solve hard problems. See our careers page for more details.

[1] After completing this post, it was brought to our attention that the not yet released MariaDB versions 10.0.18 and 10.1.4 include MDEV-7249 that implement skipping the delay when a transaction dependency is found.

[2] 500 transactions if two syncs are needed per transaction, 1000 transactions if only one is needed: the exact details are not obvious, if you knows more about that, please post a comment below.

With more than 80 designer brains at Booking.com we “think” we know quite a lot. However, being a data-loving company, we have figured out that 986 UXers know even more – 986 is the number of members in the Amsterdam UX Meet-up Group!

At Booking.com, professional growth and development take place outside as well as within the office walls. Of course we have training courses and workshops, but in this constantly changing world of the web – training is often out of date once almost as soon as it’s set up and ready to be conducted. At Booking.com, learning also takes place when we get out and talk to people, attend talks and presentations given by others in the industry and open our minds to new ideas.

And opening our minds and our doors is what we did on 7 January when we hosted the Amsterdam UX Meet-up Group. We have worked a lot with this group in the past and are proud to be a well-integrated part of the Netherlands’ Design and UX community.

After a successful 2014 spent networking, sponsoring, speaking and attending meet-ups and conferences, the Booking.com design team wanted to assure a great start to 2015. So we invited more than 120 Amsterdam UXers to head office...along with Janne Jul Jensen–who came all the way from Denmark to speak about “UI in an Agile Process”.

Janne Jul Jensen specializes in usability and interaction design, and applies her expertise primarily to the mobile app projects within Trifork. These include apps for Danske Bank, Bilka, the Roskilde Festival, KMD and DSB. She is a much sought-after speaker by educational institutions, conferences, the public sector and private companies. She also gives courses on her areas of expertise, and is the founder of the Design & Usability user group (in Danish), where peers can meet and learn from each other.

After Janne shared her knowledge with the UX community our designers still had some questions for her. Here is our interview with Janne.

How did you get into design?

I was originally being educated to be a Software Engineer at university, but was beginning to question whether that was the right choice as I thought it was lacking purpose for me. I asked some of my fellow male students why they liked the education, and often the answer was “I think it is fun to tinker with software”. But to me it wasn’t enough of a reason that I myself was having fun. I wanted a bigger purpose with my education than it just being for my own sake. I wanted an education that would have an impact on peoples' lives. So I started looking into how I could combine the two, and discovered that the software engineering degree had a specialization in Human Computer Interaction (HCI), and the rest, as they say, is history. I chose that specialization and have never looked back since.

Compared to a lot of other designers, you have quite an academic background, what was the most important thing that you learned for the practice of design?

Through my academic background I got to equip my UX toolbox with a lot more tools than most other UX’ers, as I also got to try out some of the more specialized and perhaps less applicable tools within UX. This has its advantages when I’m faced with a project that has an unusual problem or a different profile. I also find that because I have done UX research, which has had to be very rigid to achieve the academic validity required, I have a good understanding of where I can scale the various methods back without the results being affected – and if I scale back in a way that affects the results, I know HOW it affects the results.

What is the most stimulating and challenging project you have accomplished?

I have worked on many different projects, all which have had their own challenges. But one of the most stimulating projects I have worked on, was the Danske Bank Mobile Bank project. It was very rewarding to work with a client who puts quality first! Most clients are very preoccupied with cost, and usually cuts and compromises are made within UX, when the budget is under pressure. This has never been the case with Danske Bank, as they value the quality of the app as the primary objective. This allows designers to dream big and to do things the way they should be done, rather than the way they can be done within a tight budget.

Since this was your first experience creating a mobile app, how did you evaluate your agile process to ensure it was effective in this project context, and is there anything you would change?

We were using SCRUM in the project and part of the SCRUM process is the retrospective. We used this part very actively in that we ended each sprint with a retrospective, during which we looked at what had worked well in the process. Based on this, we decided which parts of the process we should keep and which we should change in the next sprint. For those things that didn’t work well, we discussed why, and came up with a new way of dealing with them that we would then try out in the next sprint.

In addition to starting the UX process a sprint ahead of the rest of the development, are there any other tips for embedding UX into the SCRUM more easily?

Another important aspect of embedding UX into SCRUM is to make it explicitly a part of the backlog. You do this by making UX points on the list for each feature (and prioritizing the feature and the UX together)! It is also important to have a product owner who explicitly values and prioritizes UX. Finally, independent of the development process, it is important to have UX as an integral part of your company culture; to have developers who value and respect UX and who will actively seek UX advice; and to have leadership who prioritize UX and find it important.

Do you think that the Agile methodology is better suited to the development of web-based or native apps?

I think it fits both! The world in general is characterized by never being static and being in a state of constant change. Humans learn constantly and are always accumulating experience. Neither of these givens is taken into account by, for example, the waterfall model, as that is based on an assumption that once a requirement specification is written, the scope of the project will never change (and no one will want it to), and that we will not learn anything new in the development process that will change our idea of what should be developed and how. Rarely in a development project is either of these assumptions true.

At Booking.com we love A/B testing, did you do A/B testing for the Danske Bank app?

I am very much in favor of A/B testing as a way to decide between two possible solutions. However, because we don’t do releases very often in mobile banking, we could potentially have two different versions of the app out for quite a long time. We feel that this would create other challenges, so we opted out of A/B tests in this project. Instead we carry out other activities to address any doubts between potential solutions.

How much do you trust user surveys?

User surveys can never tell you the full picture, and it is always important to be aware of the limitations of a survey. That said, if a survey has been well produced and correctly distributed, it can give you very valuable insight, especially as long as you keep in mind what it cannot tell you.

For the Danske Bank app, you used Facebook and other social media to get feedback and ideas from the users. What other ways do you use to find out which new features users want in their apps?

Facebook is not the only way we listen to users. We use the reviews in App Store and Google Play (yes, someone actually reads those), input from the Danske Bank support, all the customer feedback the bank gets regarding the app, surveys, formulate questionnaires, keep an eye on competitors, and follow general trends in the field. In general, we keep our eyes and ears open in any relevant context that will inform our knowledge of our users’ needs and wants.

Because the Booking.com app is translated into 40 languages, we deal with big translation challenges at Booking.com. Your initial app was developed for Deutsche Bank in Denmark, and in the meantime you were building apps for other countries. Do they all share the same code base or are they completely different apps? How did you manage translations?

The Danske Bank Mobile Bank app exists in six versions (Danish, Swedish, Norwegian, Finnish, Northern Irish, and English) and the code base is largely the same for all six apps, with a few exceptions because of a variation in regulations from country to country. Translation is handled using a large spreadsheet that contains translations for every piece of text in the app. Because of this, it is very important that absolutely none of the graphics contains any text whatsoever. One of the challenges that still remain is if, for example, a label on a button will fit that button in any language. The key here is to find the language that tends to have the longest words. In our case, if it will fit in Finnish, the other languages will almost always fit too. That said, there is no shortcut for checking this, you have to go over every screen to be sure.

And as a last question, you are teaching UX – what book would you recommend as a “Must read” to a junior designer?

There are general books and specialized books, and when it comes to someone fairly new in the trade, I would recommend the more general books. These could include:

Interaction Design: Beyond Human-Computer Interaction by Jenny Preece and Helen Sharp (Coming out February 2015)
Handbook of Usability Testing: How to Plan, Design, and Conduct Effective Tests by Jeffrey Rubin
The inmates are running the asylum by Alan Cooper
Don’t make me think by Steve Krug
Surgery made easy by Steve Krug

Booking.com is getting ready for a UX conference in April, a front-end hackathon in May and another design meet-up in June. These will happen both in and outside our office walls, but everyone’s always welcome. We can only truly learn if we open our doors for more knowledge and experience.

This is a story about a significant new optimization to the Perl interpreter. It is a story about battling code complexity. And it is a story about wanting to have our cake and eat it too.

A recent Booking.com hackathon provided us the opportunity to investigate speeding up integer allocation in the Perl interpreter. If successful, this could optimize nearly every program we run. We discovered that a naive implementation could work, but would make the code a lot more difficult to maintain. Our path lead us to attempt to leverage the C preprocessor to improve code clarity while opening doors to real gains in program execution speed.

First, some background

As described in perlguts and PerlGuts Illustrated, representations of variables in Perl are usually composed of two parts: a head struct and (optionally) a body struct. The head contains the essential “internal book-keeping” common to all variables (regardless of type) including a pointer to the (optional) body struct. Below is an image of the layout of fields in a head struct as depicted in PerlGuts Illustrated:

The body struct can be quite different, depending upon the type of the variable. The most simple type of variable is the SvNULL, which represents undef and does not need a body struct at all.

A string (called PV for “pointer value”) has a body type of XPV:

But the body struct of a PV is different to a body struct of a PVNV. A PVNV can hold a floating point number and a string representation of that same value:

One benefit of this type of design is that all references to the value point to the memory location of the head. As such, Perl is free to change which memory is being used to represent a value by changing the body struct, without having to update any pointer except the pointer contained within the head.

Changing types

Naturally, Perl has an internal function to convert between types of variable. This function is sv_upgrade (“scalar value upgrade”). In essence, whenever we have a variable of some type in Perl (for example, a simple integer) and we need to access it as a different type (for example, a string), the sv_upgrade will convert the type of the variable (for example, into a type that contains both integer and string representations of a value). This change may involve replacing the body struct with a larger struct.

To see how sv_upgrade is implemented, we can look at the Perl_sv_upgrade function in sv.c. We see that this function encapsulates a lot of complexity. There are many comments, which describe subtle corner cases. Since it can take a scalar value of essentially any type and convert it to something capable of representing any other type, perhaps it’s unsurprising that there’s a lot of code here.

Without going line-by-line, an overview of how this function is implemented may be useful. First, there is a switch based on the current type of variable, to determine what it needs for the new type. Shortly thereafter, there is a second switch based on the new type. Inside of the second switch block, there are numerous if blocks for doing different things depending on the old type. Finally, after the new body struct has been set up, and the head struct contains all the correct flags, the memory used by the old body is freed.

Still with me? Good.

A naive optimization

The sv_upgrade function is called from a number of places. Not only is it called in places such as printing integers as strings, but it’s also called when assigning an integer value to a previously cleared variable.

A previously cleared variable is always an undef with no body struct. The reason sv_upgrade is called is to allow the correct setup for the body to occur. This was a very sensible design decision, as it centralizes a lot of “internal book-keeping” behavior, rather than duplicating it. The cost of this centralization is performance: some generic (and in this case superfluous) code is executed. For example, when creating a new integer, sv_upgrade will run an unnecessary check for a conversion of a larger type to a smaller type.

Integer assignment to a cleared variable occurs so frequently that one would imagine it might be worth duplicating the required complexity to achieve the performance improvement. Thus we decided to evaluate that trade-off for integer allocation. After a review of the nearly 300 lines of sv_upgrade internals, we saw that we could remove the call entirely if we could “hoist out” just the essential two lines of code! However, there is a very good reason this had not been done before. Let's look at the two lines.

The first line (since we know the new type) is easy:

SvFLAGS(sv) |= new_type;

But the other line is quite complex:

SvANY(sv) = (XPVIV*)((char*)&(sv->sv_u.svu_iv) - STRUCT_OFFSET(XPVIV, xiv_iv));

If your head is swimming after looking at that last line of code, don’t worry: you’re in good company. Granted, this is described in Illustrated perlguts:

Since 5.10 for a raw IV (without PV) the IVX slot is in the HEAD, there is no xpviv struct ("body") allocated. The SvIVX macro abuses SvANY pointer arithmethic to point to a compile-time calculated negative offset from HEAD-1 to sv_u.svu_iv, so that PVIV and IV can use the same SvIVX macro.

But, even once I thought I maybe understood what that complex line of code supposedly did, I made Steffen patiently sit for more than 15 minutes as I convinced myself with paper and pencil that it actually did what he had described. After that, the drawing from Illustrated Perl Guts made more sense to me:

What’s more, I finally understood that this complexity exists in order to avoid executing an if statement that would otherwise be called every time the value is retrieved!

Thus, the previous design trade-offs seemed like the right ones. Yes, we could make Perl faster, but at the cost of leaking a lot of complexity into another part of the code. That complexity would make any additional development much more difficult in any future work.

Having our cake and eating it too

We wanted to encapsulate that very complex bit of code, but without any added runtime performance cost. Since this is C, we looked at using the preprocessor to push the complexity behind a macro, in much the same way that with other languages we might move some hairy lines of code behind a well named function or method:

#define SET_SVANY_FOR_BODYLESS_IV(sv) \
    SvANY(sv) = (XPVIV*)((char*)&(sv->sv_u.svu_iv) - STRUCT_OFFSET(XPVIV, xiv_iv))

And, of course, the advantage of using a macro (rather than a function) is that the cost is paid entirely at compile time, and thus there is zero additional runtime performance lost.

By introducing the macro, the readability of the code would be improved in several places.

So, how did this change our situation? With the macro, the two lines we wanted to “hoist out” looked a lot less complex. All we would need would be a patch which replaces this one computationally heavy function call:

sv_upgrade(dstr, SVt_IV);

with just these two lines:

SET_SVANY_FOR_BODYLESS_IV(dstr);
SvFLAGS(dstr) |= SVt_IV;

By factoring the complexity differently, we opened the doors to a different set of performance/complexity cost trade-offs. We would leak only a small amount of complexity relative to the gain in performance. But is it worth it? With these changes, the cost would be low, but what would the actual benefit be?

Measuring the gain

The micro-benchmark we used is very heavy on exercising one particular code path, however it is a common code path.

$dumbbench-i50--pin-frequency--\./perl-Ilib-e\'for my $x (1..1000){my @a = (1..2000);}'

Here are the results of the micro-benchmark before the optimization:

  Rounded run time per iteration: 2.4311e-01 +/- 1.4e-04

And here are the results after the optimization:

  Rounded run time per iteration: 1.99354e-01 +/- 5.5e-05

That’s 18% faster.

Results

Overall, this hackathon project was a success.

With these measurements, we demonstrated that the optimization was worthwhile. The complexity cost is pretty close to zero: in some places the Perl core is slightly more complex, but the internals of sv_upgrade are certainly more clear. Additionally, we found there were other similar potential optimizations that could leverage the same technique. In the end, five related patches were pushed to Perl:

When Perl 5.22 ships, in part because of this work, many real-world programs will be faster.

It’s no secret that Booking.com has a strong data-driven culture. We validate our work through A/B experimentation, allowing millions of customers to have their say in what works best. But quantitative research is not our answer to everything; we adjust our toolset to the problem at hand.

When we set out to build Booking.com for Business, it immediately felt like we were in a startup. Suddenly, the wealth of experimentation data at our disposal wasn’t enough to start building the application for businesses. To kick-off the design work, we needed to know more about our business users’ needs, motivations, and current frustrations. We needed to get out of the building and talk to them.

Initial research

We performed a series of user interviews in several countries with a significant share of business travel. We met with business travellers of all types (from interns to CEOs), along with the people who organise the business trips for them.

We learned that business users have unique set of needs. Booking a holiday can be a fun pastime on its own, but booking a business trip is part of a job—it needs to be as efficient as possible. There’s also more to business travel than making the booking itself. Companies need an overview of who is going where and how budgets are spent. Existing business travel solutions either don’t satisfy such needs, or they’re expensive and complex.

These and many other insights became the foundation for our work. They were synthesized into a set of user personas that shaped the design of the product as we went on.

Product vision

Armed with knowledge about potential users, we started brainstorming. Our aim was to create a vision of the product that would help our personas accomplish their goals. The outcome of these brainstorms was a list of high-level requirements that were later visualised as a set of wireframes. This tangible representation of our ideas enabled us to have fruitful conversations with stakeholders. The wireframes were high-level enough that we could temporarily set aside details of technical implementation and visual design, but also detailed enough to convey the purpose of each screen.

Minimal viable product (MVP)

The product vision was exciting, but it was just a hypothesis. We didn’t want to spend months implementing something ultimately not useful for customers, and not beneficial to our business. It was important to get the product out there as soon as possible, and to start learning from real world usage. We knew that we already had a great product—Booking.com itself—and we could build on its strengths. With this in mind, we defined the minimal scope that would be sufficient to validate our ideas, and started mapping out the user journey.

Filling the user journey with designs felt like finishing a puzzle. As we progressed, we could see how complete the whole picture was from the design perspective.

However, soon after we started implementation, we noticed a problem. Separate design mockups didn’t provide a true feeling of the user experience. They were static. It wasn’t always clear how the application would respond to user actions, and how one page would transition into another. We found ourselves figuring out these details along the way.

It was also important to get user feedback on design decisions we had made so far. Unfortunately, the actual product was still at the early stage of development, and putting mockups in front of users gave us limited feedback.

Prototype

As a response to these issues, we created an interactive prototype that simulated the end-to-end user flow. For example, it was possible to land on the product page, go through the sign-up process, view transactional emails, experience key application features, sign-out, and sign back in again. In this way, we solved two problems at once: we had created a tool that would better guide product development and procure high-quality user feedback.

We kept it lean and didn’t spend much time creating the prototype. We simply placed mockups in HTML files and connected them with hyperlinks. In-page interactions were triggered by bits of basic Javascript code that showed images of various UI states on click. For example, when the user clicked on an area of the mockup that had a button, the image changed simulating interface response.

Some design solutions that previously looked good as static mockups didn’t work so well when presented in the dynamic prototype. Acknowledging this helped us to fix design issues before they reached the product. Participants in user testing sessions were also more engaged with the prototype because it felt like a real product.

But prototypes are not without limitations. It’s hard to do full-blown usability tests with them. They may look real, but not every possible scenario is supported, so facilitators need to carefully steer participants. Maintaining the prototype also becomes tedious over time. We tried to make it easier by separating reusable parts like header, footer, navigation, etc. into include files. We accomplished this by using Jekyll, a static website generator.

The good news is that we didn’t have to rely solely on the prototype for long. The product quickly took shape and it soon became possible to put the real thing in front of users.

User feedback

After the product reached the MVP state, it became easier to get user feedback. Although the product wasn’t yet ready to be publicly announced, we were able to start gathering usage data and feedback from early adopters. We also continued usability testing, because the usage data told us what was happening, but often left us wondering why.

Even with the working product at our disposal, we continued using prototypes to fill the gaps during usability tests. We seamlessly integrated feature prototypes into the live product and switched them on specifically for usability session participants. This helped us to establish whether mocked-up ideas were worth implementing, and also to test parts of the application that were still in development.

Usability labs were not our only test environment. We visited company offices and observed how the product was performing in the real world. These office visits were highly valuable, providing us with insight from observation of users in their natural environment. We had a chance to see what tools they used, what workarounds they developed, and how our product would fit their work process. This was absolute gold. Some things that performed well in the lab set-up, failed during office visits.

We saw that our users were working in a very busy environment and were constantly distracted. The time they spent making a decision was very short. Plus, they were sceptical about introducing new tools into their work. All this posed a particular challenge for a crucial step in the user journey: the product page. Our users needed tangible proof that the product would do as it promised, and that it was reliable. They wanted to explore the product before making any commitments.

Product page

The research findings informed new product page designs. But we needed confirmation that they would actually solve the problems we had observed. We opted for remote surveys as a method to gather feedback quickly and on a large scale. This enabled us to cover several markets and bring a quantitative component into the research.

Survey participants were presented with various versions of the page and were asked to click and comment on page elements that stood out to them. Afterwards they were asked a series of questions that helped us gauge how well they understood our offering, and how likely they were to sign-up.

It took us a few survey iterations to arrive at a version that proved to work well for our users. Had we gone with one of the new designs without testing, we would have ended-up with a sub-optimal page upon product launch.

Now that we had the fully-tested user journey in place, we could reveal the product to the world.

Final thoughts

Fast forward to today. The product is up and running. We can now make decisions through A/B experimentation as there is an established base and a sufficient number of users that continues to grow. If we look back on the process that brought us here, this is what comes to mind:

In an environment of uncertainty, it was important to remain open to change. We had to think creatively not just about the product itself, but also about how to get there.
We were focused on the user from day one and all the way through. Even when we didn’t yet have the complete product, we used the prototype to get user feedback.
By combining quantitative and qualitative research methods, we got the best of both worlds. This was and will remain our recipe to continuously improve the user experience.

This is the annex to Evaluating MySQL Parallel Replication Part 3: Benchmarks in Production.

There is no introduction or conclusion to this post, only landing sections: reading this post without its context will probably be very hard. You should start with the main post and come back here for more details.

Environments

As in the two previous posts (Part 1 and Part 2), we are using the same four environments. Each environment is composed of four servers:

-----     -----     -----     -----
| A | --> | B | --> | C | --> | D |
-----     -----     -----     -----

A is a true production master running MySQL 5.6,
B is an intermediate master running MariaDB 10.0.16 without parallel replication enabled (slave_parallel_threads = 0),
C is an intermediate master running MariaDB 10.0.16 with slave group commit enabled (slave_parallel_threads> 1, binlog_commit_wait_count> 1 and binlog_commit_wait_usec> 1),
D is the slave where parallel replication tests are run with different parameters.

The slave group commit parameters on C are the following:

Parameters	E1	E2	E3	E4
`slave_parallel_threads`	40	40	40	40
`binlog_commit_wait_count`	35	35	35	35
`binlog_commit_wait_usec`	100.000	100.000	50.000	150.000

To monitor parallelism identification (group committing) on C, the binlog_commits and binlog_group_commits global statuses are gathered regularly.

The D slave was stopped on a Thursday between 17:20 and 17:30 and a backup of the database was taken. Every test starts by recreating the filesystem that hosts the database and restoring the backup on this empty filesystem.

When stopping the database (for backup), the buffer pool was dumped using innodb_buffer_pool_dump_at_shutdown. When starting MariaDB (after backup restoration), innodb_buffer_pool_load_at_startup was used to load the buffer pool. No test was started before the buffer pool was fully loaded.

Every test then started by running the D slave until Friday at 06:00 (a little more than 12 hours). This allowed further warming up of InnoDB, including giving work to purge and write threads.

Then, the D slave ran 24 hours of transactions where parallelism was identified using slave group commit on the C intermediate master. The number of commits and the group commit sizes on C for that period are shown in the graphs below for the four test environments.

Graphs # 0: Commits and Group Commit Sizes in the four Environments

The D servers have the following properties:

Property	E1	E2	E3	E4
CPU	2x Intel E5-2640		2x Intel X5675
RAM	96 GB	192 GB	96 GB
Disk Controller	HP P220i		HP P410i
Controller Cache	512 MB (25% Read / 75% Write)
Logical Disk	838.3 GB (RAID 1)		1.6 TB (RAID 1+0)
Physical Disks	2x SAS 900.1 GB 10K RPM		6+1x SAS 600 GB 10K RPM
Database Size	382 GB	234 GB	586 GB	491 GB
Buffer Pool Size	71 GB	162 GB	76 GB
Binlog Format	ROW	STATEMENT	ROW

To monitor the commit rate on D, the COM_COMMIT global status is gathered regularly.

To prevent D from filling up its disk, relay_log_space_limit is set to 4 GB (some D do not have enough disk space to store all the binary logs of C).

Group Commit: Slave vs. Master

It is briefly mentioned in the main post that slave group commit identifies less parallelism than a true master would. More details are given here.

Let's consider this execution timeline which shows a master where T5 is conflicting with one of the four previous transactions:

   ------Time----->
T1:      B--C
T2:   B-----C
T3:   B-----C
T4:    B----C
T5:  B-- . . --C
T6:       B----C

Using slave group commit and from the binary logs of the timeline above, parallelism can be identified in two groups as shown below.

   ---------------Time---------------->
T1:  B-- . . . . . . . . C
T2:     B----- . . . . . C
T3:           B----- . . C
T4:                 B----C
T5:                       B---- . . C
T6:                            B----C

But if we run the master with delayed commit, we could get the execution timeline below, in which T6 passed T5 to join the first commit group.

   ------Time----->
T1:      B-- . C
T2:   B----- . C
T3:   B----- . C
T4:    B---- . C
T5:  B-- . . . .--C
T6:       B----C

So on a master, transactions blocked in their execution do not block other non-conflicting transactions from joining a commit group, but those block parallelism identification in slave group commit. This is why slave group commit is not as efficient at maximizing group size as a parallel execution on the master would be.

Results

As outlined in the main post, our tests are run in the following binary log configurations:

Intermediary Master (IM): binary logs and log-slave-updates enabled.
Slave with Binary Logs (SB): binary logs enabled and log-slave-updates disabled.
Standard Slave (SS): binary logs and log-slave-updates disabled.

And in the following durability configurations:

High Durability (HD): sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1.
No Durability (ND): sync_binlog = 0 and innodb_flush_log_at_trx_commit = 2 (also described/known as relaxed durability).

For each of those configurations (6 in total: IM-HD, IM-ND, SB-HD, SB-ND, SS-HD, and SS-ND), the tests are run with different values of slave_parallel_threads (SPT). The full results are presented in the following four tables. The times presented are in the format hours:minutes.seconds. Below the time taken to process 24 hours of transactions, the speedup achieved from the single-threaded run is presented (in bold) and when applicable, the speedup from the previous run is also presented (in normal font).

Normally, the tests are run at least twice and the shorter time is kept. This is relaxed for E1, E3, and E4 in the SS-HD and SS-ND configurations where only one run is performed: the results from E2 show that SS-HD and SS-ND are very similar to SB-HD and SB-ND respectively, and the limited results with E1, E3, and E4 are consistent with this observation. Moreover, again for E1, E3, and E4 in SS-HD and SS-ND and for the same reasons, some SPT values are skipped (the skipped values are shown as N/A in the tables below).

Table # 1: E1 Execution Times and Speedups
SPT	IM-HD	IM-ND	SB-HD	SB-ND	SS-HD	SS-ND
0	10:36.02	5:33.17	9:11.11	5:22.29	9:11.41	5:23.03
5	7:23.35 1.43	5:22.43 1.03	6:47.38 1.35	5:15.03 1.02	N/A	N/A
10	6:36.31 1.60 1.12	5:18.34 1.05 1.01	6:16.49 1.46 1.08	5:10.02 1.04 1.02	6:16.54 1.46	5:09.01 1.05
20	6:14.40 1.70 1.06	5:15.23 1.06 1.01	6:00.22 1.53 1.05	5:07.16 1.05 1.01	N/A	N/A
40	6:07.04 1.73 1.02	5:14.12 1.06 1.00	5:53.42 1.56 1.02	5:05.36 1.06 1.01	5:53.07 1.56	5:06.10 1.06

Table # 2: E2 Execution Times and Speedups
SPT	IM-HD	IM-ND	SB-HD	SB-ND	SS-HD	SS-ND
0	3:01.00	1:16.24	2:50.58	1:11.30	2:51.22	1:11.28
5	1:55.12 1.57	1:14.07 1.03	1:40.32 1.70	1:11.19 1.00	1:39.47 1.72	1:10.41 1.01
10	1:41.58 1.78 1.13	1:10.40 1.08 1.05	1:28.48 1.93 1.13	1:09.40 1.03 1.02	1:28.49 1.93 1.12	1:08.47 1.04 1.03
15	1:36.12 1.88 1.06	1:10.20 1.09 1.00	1:25.55 1.99 1.03	1:10.01 1.02 1.00	1:25.39 2.00 1.04	1:09.10 1.03 0.99
20	1:34.15 1.92 1.02	1:10.07 1.09 1.00	1:24.52 2.01 1.01	1:09.03 1.04 1.01	1:24.19 2.03 1.02	1:08.00 1.05 1.02
25	1:32.34 1.96 1.02	1:10.14 1.09 1.00	1:23.02 2.06 1.02	1:08.53 1.04 1.00	1:23.35 2.05 1.01	1:09.00 1.04 0.99
30	1:31.27 1.98 1.01	1:10.06 1.09 1.00	1:23.10 2.06 1.00	1:10.03 1.02 0.98	1:21.56 2.09 1.02	1:08.25 1.04 1.01
40	1:30.15 2.01 1.01	1:09.35 1.10 1.01	1:22.12 2.08 1.01	1:08.32 1.04 1.02	1:22.13 2.08 1.00	1:08.25 1.04 1.00
80	1:29.49 2.02 1.00	1:09.53 1.09 1.00	1:21.16 2.10 1.01	1:08.39 1.04 1.00	1:21.52 2.09 1.00	1:08.50 1.04 0.99

Table # 3: E3 Execution Times and Speedups
SPT	IM-HD	IM-ND	SB-HD	SB-ND	SS-HD	SS-ND
0	10:07.42	9:19.50	9:58.58	9:06.20	9:58.06	9:05.08
5	8:45.11 1.16	8:31.04 1.10	8:41.47 1.15	8:24.40 1.08	N/A	N/A
10	8:24.25 1.20 1.04	8:20.14 1.12 1.02	8:23.07 1.19 1.04	8:16.23 1.10 1.02	8:22.49 1.19	8:15.59 1.10
20	8:12.57 1.23 1.02	8:11.05 1.14 1.02	8:06.12 1.23 1.03	8:05.13 1.13 1.02	N/A	N/A
40	8:06.17 1.25 1.01	8:11.05 1.14 1.02	8:07.26 1.23 1.00	8:04.31 1.13 1.00	8:09.47 1.22	8:04.30 1.13

Table # 4: E4 Execution Times and Speedups
SPT	IM-HD	IM-ND	SB-HD	SB-ND	SS-HD	SS-ND
0	7:40.57	7:31.37	7:43.06	7:26.24	7:41.10	7:23.34
5	6:32.18 1.17	6:32.42 1.15	6:36.06 1.17	6:29.52 1.15	N/A	N/A
10	6:15.19 1.23 1.05	6:15.16 1.20 1.06	6:17.59 1.23 1.05	6:13.48 1.19 1.04	6:15.36 1.23	6:11.27 1.19
20	6:05.46 1.26 1.03	6:06.57 1.23 1.02	6:05.40 1.27 1.03	6:05.48 1.22 1.02	N/A	N/A
40	6:02.08 1.27 1.01	6:00.28 1.25 1.02	5:59.30 1.29 1.02	5:59.57 1.24 1.02	6:02.21 1.27	5:55.19 1.25

Graphs during Tests

If you spot something we might have missed in the graphs below, please post a comment. Those graphs include the number of commits per second, CPU stats and Read IOPS for all environments, for the Slave with Binary Logs configuration (log-slave-updates disabled), in both durability settings (high and no/relaxed).

Graphs # 1a: E1 Stats - Slave with Binary Logs - High Durability

Graphs # 1b: E1 Stats - Slave with Binary Logs - Relaxed Durability

Graphs # 2a: E2 Stats - Slave with Binary Logs - High Durability

Graphs # 2b: E2 Stats - Slave with Binary Logs - Relaxed Durability

Graphs # 3a: E3 Stats - Slave with Binary Logs - High Durability

Graphs # 3b: E3 Stats - Slave with Binary Logs - Relaxed Durability

Graphs # 4a: E4 Stats - Slave with Binary Logs - High Durability

Graphs # 4b: E4 Stats - Slave with Binary Logs - Relaxed Durability

Workloads

It is mentioned briefly in the main post that the four test environments have different workloads:

E2 is a CPU-bound workload (the dataset fits in RAM).
E1 is also mostly CPU-bound but with some cache misses in the InnoDB buffer pool, needing a page fetch from disk before doing a write.
E3 is a mixed CPU and IO workload (more cache misses in the InnoDB buffer pool but still with enough cache hit to get a good commit throughput).
E4 is an IO-bound workload (mostly cache misses).

We can confirm that E2 does not do much IO looking at IOWait and Read IOPS in the graphs below (commit throughput is also good).

Graphs # 5: E2 is CPU-bound
(good commit throughput, low IOWait and few Read IOPS).

E1 has a little more IOWait and Read IOPS but still not that much as you can see in the graphs below. However, commit throughput is smaller than E2, but still good (the transaction sizes are bigger).

Graphs # 6: E1 is also mostly CPU-bound
(good commit throughput, low IOWait and few Read IOPS).

E4 is clearly an IO-bound workload: commit throughput is low, IOWait is high, and Read IOPS are high as shown in the graphs below.

Graphs # 7: E4 is IO-bound
(low commit throughput, high IOWait and high Read IOPS)

E3 has a much higher commit throughput than E4 but with lots of Read IOPS as shown in the graphs below. This makes us put this workload in a mixed CPU and IO-bound category (confirmed by a high IOWait but still lower than E4).

Graphs # 8: E3 is mixed CPU and IO-bound
(respectable commit throughput, high IOWait and high Read IOPS)

Note: the Write IOPS are not as important because they are buffered in the RAID controller write cache. The Read IOPS are important because a write operation to the database that needs an InnoDB page fetch from disk is bounded by the disk seek time.

Additional Discussions

Another Problem with Long-Running Transactions

In the main post, we discussed the impact of long-running transactions on the parallel replication pipeline (idle threads while the long transaction is running). When the number of slave threads is not smaller than the group commit size, the time taken to execute the group is bound by the time taken to run the longest transaction, as shown below.

   --------Time-------->
T1:  B-----------C
T2:           B--C
T3:           B--C
T4:           B--C
T5:              B--C
T6:              B--C

The transactions above on the master are executed as below on the slave with SPT=4.

   --------Time-------->
T1:  B-----------C
T2:  B-- . . . . C
T3:  B-- . . . . C
T4:  B-- . . . . C
T5:               B--C
T6:               B--C
              1
     12345678901234567

But if the number of slave threads is smaller than the group commit size, we get a longer run, as shown below (SPT=2):

   ---------Time-------->
T1:  B-----------C
T2:  B-- . . . . C
T3:               B--C
T4:               B--C
T5:                   B--C
T6:                   B--C
              1         2
     123456789012345678901

So we lost 4 units of time in the global run by decreasing SPT. What is interesting is that there is enough time for executing both T3 and T4 while T1 executes (and while the thread running T2 is idle), so maybe it is possible to avoid this situation.

Note that MySQL 5.7 solves this problem by allowing out-of-order commit. With slave-preserve-commit-order set to 0 (the default), MySQL will commit T2 without waiting for T1 to commit. This will allow T3 to start executing while T1 is still running. However, this changes the behavior of the slave: T2 (and eventually T3 and T4) becomes visible before T1, which is not possible in single-threaded replication. Some people (including the author of this post) might be nervous about this and would prefer to set slave-preserve-commit-order to 1.

A future solution to manage long-running transactions and group size larger than SPT could be to delegate committing transactions. If the thread running T2 delegates its commit to another thread, it can start running T3 event if T2 is not committed. This could be an interesting optimization to implement in MariaDB or in MySQL.

(Note that the problem described above - group size larger than SPT - cannot explain the modest speedup observed in our tests as we identify parallelism with binlog_commit_wait_count = 35 and we run tests with SPT up to 40.)

Relaxed Durability Burns CPU for E1 and E2

As presented in the main post, E1 and E2 have respective speedups of ~1.06 and ~1.04 for SB-ND. However, as shown in the graphs below, while increasing SPT, CPU consumption also increases noticeably without significantly improving the commit throughput.

Graphs # 9: E1 Stats - Slave with Binary Logs - Relaxed Durability

Graphs # 10: E2 Stats - Slave with Binary Logs - Relaxed Durability

Either we hit a synchronization bottleneck in the parallel applier or there is room for optimization: there is probably interesting work to be done here.

Parallel replication is a much-expected feature of MySQL. It is available in MariaDB 10.0 and in MySQL 5.7. In this 3rd post of the series, we present benchmark results from Booking.com production environments.

Note: this post has an annex: Under the Hood. Benchmarking is a complex art and reporting results accurately is even harder. If all the details were put in a single article, it would make a very long post. The links to the annex should satisfy readers eager for more details.

Parallel replication is on its way and with it comes the hope that more transactions can be run on a master without introducing slave lag. But it remains to be seen whether this dream will come true - will parallel replication hold its promise or will there be surprises after its deployment? We would like to know whether or not we can count on that feature in the future.

To get answers, nothing is better than experimenting with the technology. Benchmark results have already been published (MariaDB 10.0: 10 times improvement; and MySQL 5.7: 3 to 6 times improvement) but results might be different in our production environments. The best test would be to run parallel replication on our real workloads but this is not trivial. To be able to run transactions in parallel, a slave needs parallelism information from the master (for more details, see Part 1). With a master in an older version (MySQL 5.6 in our case), slaves do not have this information.

Luckily, we can use slave group commit on an intermediate master to identify transactions that can be run in parallel. The trick is to execute transactions sequentially on a slave, but to delay their commit. While the commit is delayed, the next transaction is started. If the two transactions are non-conflicting, the second could complete and the two transactions would commit together. In this scenario, grouping has succeeded (the two transactions committed in a single group) and parallelism is identified (as they commit together, the transactions are non-conflicting, thus can be run in parallel on slaves). For more details on slave group commit, see Part 2.

Our benchmarks are established within four production environments: E1, E2, E3, and E4. Each of these environments is composed of one production master, with some intermediate masters, and a leaf slave. A complete description of those environments can be found in the annex.

Slave group commit identifies less parallelism than a parallel execution on a master would identify (details in the Group Commit: Slave vs. Master section of the annex). Even so, decent group sizes are obtained, as shown in the group commit size graphs in the annex. Usually we have group sizes of at least 5, most of the time they are larger than 10, and sometimes they are as big as 15 or even 20. These will allow us to test parallel replication with real production workload.

Before closing this long introduction, let's talk a little about our expectations. InnoDB is getting better at using many cores (RO and RW benchmark results) but single-threaded replication gets in the way of pushing more writes in a replicating environment. Without parallel replication, a single core can be used on a master to perform writes without incurring slave lag. This is disappointing as servers come with 12 and more cores (only one can be used for writes). Ideally, we would like to use a significant percentage of the cores of a server for writes (25% could be a good start). So speedups of 3 would be good results at this point (12 cores), and speedups of 6 and 10 would be needed in the near future (24 and 40 cores).

The Test: Catching Up with 24 Hours of Transactions

Our benchmark scenario is as follows: after restoring a backup of the database, starting MariaDB, waiting for the buffer pool to be loaded, and running the slave for at least 12 hours, we measure the length of time it has taken to process 24 hours of production transactions.

The tests are run in the following binary log configurations:

Intermediary Master (IM): both binary logs and log-slave-updates are enabled.
Slave with Binary Logs (SB): binary logs are enabled but log-slave-updates is disabled.
Standard Slave (SS): both binary logs and log-slave-updates are disabled.

And in the following durability configurations:

High Durability (HD): sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1.
No Durability (ND): sync_binlog = 0 and innodb_flush_log_at_trx_commit = 2 (also described/known as relaxed durability).

For each of those configurations (6 in total: IM-HD, IM-ND, SB-HD, SB-ND, SS-HD, and SS-ND), the tests are run with different values of slave_parallel_threads (SPT). The full results are presented in the annex and the most interesting results are presented below (SB-HD and SB-ND). The times presented are in the format hours:minutes.seconds. Below the time taken to process 24-hours of transactions, the speedup achieved from the single-threaded run is presented in bold.

Execution Times and Speedups for Slave with Binary Logs
	E1		E2		E3		E4
SPT	SB-HD	SB-ND	SB-HD	SB-ND	SB-HD	SB-ND	SB-HD	SB-ND
0	9:11.11	5:22.29	2:50.58	1:11.30	9:58.58	9:06.20	7:43.06	7:26.24
5	6:47.38 1.35	5:15.03 1.02	1:40.32 1.70	1:11.19 1.00	8:41.47 1.15	8:24.40 1.08	6:36.06 1.17	6:29.52 1.15
10	6:16.49 1.46	5:10.02 1.04	1:28.48 1.93	1:09.40 1.03	8:23.07 1.19	8:16.23 1.10	6:17.59 1.23	6:13.48 1.19
20	6:00.22 1.53	5:07.16 1.05	1:24.52 2.01	1:09.03 1.04	8:06.12 1.23	8:05.13 1.13	6:05.40 1.27	6:05.48 1.22
40	5:53.42 1.56	5:05.36 1.06	1:22.12 2.08	1:08.32 1.04	8:07.26 1.23	8:04.31 1.13	5:59.30 1.29	5:59.57 1.24

In the Graph during Tests section of the annex, you can find many details about the different test runs.

Discussion

There are lots of things to say about those results. Let's start with observations that are not related to parallel replication (SPT=0):

Obs1: Standard Slave results (without binary logs) are very close to the results of Slave with Binary Logs (without log-slave-updates) (details in the annex).
Obs2: log-slave-updates has visible cost for E1 (time difference between IM-HD and SB-HD), a less obvious but still noticeable cost for E2 and E3, and it is a win for E4 (that last one is disturbing, the numbers are in the annex).
Obs3: relaxing durability is a huge win for E1 and E2, a more limited win for E3, and a much smaller one for E4.

With reference to Obs1 above, this shows that binary logs should probably not be disabled on slave: the cost is almost inexistent and the wins are big (tracing errant transactions and being a candidate for master promotion). However, slaves with log-slave-updates are slower than slaves with only binary logs enabled (Obs2 above), so log-slave-updates should be avoided when possible. Binlog Servers can be used to replace log-slave-updates for intermediate masters, see MySQL Slave Scaling for more details (see also Better Parallel Replication for MySQL for an explanation why log-slave-updates is bad on intermediate masters for parallel replication).

With reference to Obs3 above, this can be explained by the different workload of the four environments (more details about the workloads can be found in the annex):

E2 is a CPU-bound workload (the dataset fits in RAM).
E1 is also mostly CPU-bound but with some cache misses in the InnoDB buffer pool, so it needs a page fetch from disk before doing a write.
E3 is a mixed CPU and IO workload (more cache misses in the InnoDB buffer pool but still with enough cache hit to get a good commit throughput).
E4 is an IO-bound workload (mostly cache misses).

Relaxing durability on CPU-bound workloads achieves good throughput improvements, but this does not happen on IO-bound workloads.

Now, let's focus on parallel replication. The Standard Slave results (SS-HD and SS-ND) are not worth discussing as they are very close to the Slave with Binary Logs results (Obs1 above). We will also not discuss Intermediate Master results (IM-HD and IM-ND) as they should be replaced by Binlog Servers. So all observations below are made on the results of Slave with Binary Logs (SB-HD and SB-ND):

Obs4: the best speedup (~2.10) is in E2 with high durability. E1 follows with a speedup of ~1.56 (always with high durability).
Obs5: the speedups for E4 are modest (~1.29) and the results are almost identical for both durability settings.
Obs6: for E1 and E2, the speedups with no durability are almost non-existent (less than 1.10).
Obs7: for both E1 and E2, relaxing durability with single-threaded replication leads to faster execution than enabling parallel replication.
Obs8: the results for E3 are halfway between E1/E2 and E4: both SB-HD and SB-ND get some modest speedups from parallel replication (like E4 and opposite to E1/E2) and relaxing durability makes things run a little faster (like E1/E2 and opposite to E4), but not to the point where single-threaded low durability is faster than multi-threaded high durability.

All those observations point to the importance of the workload in parallel replication speedups:

CPU-bound workloads seem to get modest speedups in high-durability configurations.
Relaxing durability for CPU-bound workloads looks like a better option than enabling parallel replication on a high-durability configuration.
IO-bound workloads get more limited speedups.

Our first reaction is disappointment: the speedups are not as high as expected. Don't get us wrong: faster is always better, especially when the only thing to do is to upgrade the software, which we will do anyway eventually. However, having only 25% more writes on a master (or 110% depending on which environment we look at) will not help us in the long term. Parallel replication is not the solution (at least not the only solution) that will allow us to stop/avoid sharding.

Ideas and Future Work

We have a hypothesis explaining the modest speedups: long-running transactions. In the presence of long-running transactions, the parallel replication pipeline on the slave stalls. Let's take the following six transactions committing on the master in two commit groups (B for begin and C for commit):

   --------Time-------->
T1:  B-----------C
T2:           B--C
T3:           B--C
T4:           B--C
T5:              B--C
T6:              B--C

Running those transactions on a single-threaded slave takes 33 units of time (time scale is at the bottom):

   ----------------Time---------------->
T1:  B-----------C
T2:               B--C
T3:                   B--C
T4:                       B--C
T5:                           B--C
T6:                               B--C
              1         2         3
     123456789012345678901234567890123

Running those transactions on a multi-threaded slave with SPT=4 takes 17 units of time:

   ---------Time-------->
T1:  B-----------C
T2:  B-- . . . . C
T3:  B-- . . . . C
T4:  B-- . . . . C
T5:               B--C
T6:               B--C
              1
     12345678901234567

So we barely achieve a speedup of 2 (and the second commit group does not even contain a large transaction). The low speedup is explained by T1 being much bigger than the other transactions in the group. So our intuition is that to get better speedup with parallel replication, all transactions should be of similar size, and bigger transactions should be broken down into smaller ones (when possible).

We have many of those big transactions in our workload at Booking.com. Most of our design choices predate MySQL 5.6, where a commit was expensive. Reducing the number of commits was a good optimization at that time, so doing many changes in a single transaction was a good thing. Now, with binary log group commit, this optimization is less useful but does not harm. However, this optimization is very bad for parallel replication.

There are at least two other things to discuss from those results but this post is already too long, so you will have to go in the annex to read the Additional Discussions.

Conclusion

It is possible to test parallel replication with true production workload, even if the master is running an old version of MySQL. Thanks to slave group commit in MariaDB 10.0, we can identify parallelism on intermediate master and enable parallel replication on a slave. Even if this parallelism identification is not as good as it would be on a master, we get decent group sizes.

Our CPU-bound workloads are getting speedups of ~1.56 to ~2.10 with high-durability constraints. This is a little disappointing: we would like to have more than two cores busy applying writes on slaves. Our guess is that better speedup could be obtained by hunting down large transactions, but that still needs to be verified. At this point, and for this type of workload, our tests show that relaxing durability is a better optimization than enabling parallel replication. Finally, with relaxed durability, parallel replication shows almost no improvement (4% to 6% improvement), and it is still unknown if hunting down large transactions and splitting them would result in better speedups.

Our IO-bound workloads are getting speedups of ~1.23 to ~1.29, which is also disappointing but expected because it is hard to fight against seek time of magnetic disks. In this type of workload, relaxed durability setting benefits from parallel replication. However, at high enough parallelism on the slave and for this type of workload, relaxing durability is not very beneficial. It is hard to tell what types of improvement would come from hunting down large transactions for this type of workload.

The next step on parallel replication evaluation would be to try optimistic parallel replication. This will make a good fourth part in the series.

In a MySQL replication deployment, the master is a single point of failure. To recover after the failure of this critical component, a common solution is to promote a slave to be the new master. However, when doing so using classic methods, the slaves need to be reconfigured. This is a tedious operation in which many things can go wrong. We found a more simple way to achieve master promotion using Binlog Server. Read on for more details.

When a master fails in a MySQL replication deployment, the classic way to promote a slave to be the new master is the following:

Find the most up-to-date slave.
If the most up-to-date slave is not a good candidate master, level a suitable candidate with the most up-to-date slave [1].
Repoint the remaining slaves to the new master.

The procedure above needs to contact all slaves in step #1, and to reconfigure all slaves in step #3. This becomes increasingly complex in Booking.com environments where we have very wide, and still growing, replication topologies; it is not uncommon to have more than fifty (and sometimes more than a hundred) slaves replicating from the same master. Many things can go wrong when tens of slaves need to be contacted and reconfigured:

some slaves might be down for maintenance or for taking a backup,
some slaves could be temporarily unreachable for other reasons,
and a few slaves could be processing a big backlog of relay logs (including delayed slaves), which will make them hard/unsuitable to reconfigure.

A way to reduce the complexity of master promotion is presented below, but to get there, we must first give some context about Binlog Servers and abstract them into a service.

Reminders about Binlog Servers

In a previous post, I described how to take advantage of Binlog Server to perform master promotion without GTIDs and without log-slave-updates, while still requiring to reconfigure all slaves. To do this, the slaves must replicate through a Binlog Server. This gives us the following deployment with a single Binlog Server:

+---+
| A |
+---+
  |
 / \
/ X \
-----
  |
  +----------+----------+----------+----------+----------+
  |          |          |          |          |          |
+---+      +---+      +---+      +---+      +---+      +---+
| B |      | C |      | D |      | E |      | F |      | G |
+---+      +---+      +---+      +---+      +---+      +---+

or with redundant Binlog Servers:

+---+
| A |
+---+
  |
  +--------------------------------+
  |                                |
 / \                              / \
/ X \                            / Y \
-----                            -----
  |                                |
  +----------+----------+          +----------+----------+
  |          |          |          |          |          |
+---+      +---+      +---+      +---+      +---+      +---+
| B |      | C |      | D |      | E |      | F |      | G |
+---+      +---+      +---+      +---+      +---+      +---+

or with more than one site with redundant Binlog Servers.

  +---+
  | A |
  +---+
    |
    +-----------+------------------------+
    |           |                        |
   / \         / \                      / \         / \
  / X \       / Y \                    / Z \------>/ W \
  -----       -----                    -----       -----
    |           |                        |           |
  +-+-----------+-+                    +-+-----------+-+
  |               |                    |               |
+---+           +---+                +---+           +---+
| S1|    ...    | Sn|                | T1|    ...    | Tm|
+---+           +---+                +---+           +---+

These schemas are becoming increasingly complex - let's simplify them by abstracting the Binlog Servers.

Binlog Server Abstraction

By hiding the Binlog Servers in an abstracted layer, which I call the Distributed Binlog Serving Server (DBSS), a deployment on three sites becomes the following:

   +---+
   | M |
   +---+
     |
+----+----------------------------------------------------------+
|                                                               |
+----+---------+-----------+---------+-----------+---------+----+
     |         |           |         |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T1| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

Of course, the DBSS is built with many Binlog Servers. One way to build the layer above minimizing the number of slaves served by the master is described below. Other ways to build this layer can be imagined [2], but let's stick to this one, for now.

+----|----------------------------------------------------------+
|    +---------------------+---------------------+              |
|    |                     |                     |              |
|   / \                   / \                   / \             |
|  / X1\----->/ \        / X2\----->/ \        / X3\----->/ \   |
|  -----     / Y1\       -----     / Y2\       -----     / Y3\  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

In the deployment above, using one DNS A record per site resolving to both Xi and Yi, if a Binlog Server fails, its slaves will reconnect to the other one. If the Yi Binlog Server fails, nothing more needs to be done. If the Xi Binlog Server fails, the corresponding Yi must be repointed to the master. This repointing is easy, as, by design, a Binlog Server is identical to its master. Only the destination server must be changed, and the binary log filename and position stay the same.

When the Master Fails...

Equipped with the above implementation of the DBSS, in a situation when the master fails, we end up with the state below; each site might be at a different position in the binary log stream of the failed master.

+---------------------------------------------------------------+
|                                                               |
|   / \                   / \                   / \             |
|  / X1\----->/ \        / X2\----->/ \        / X3\----->/ \   |
|  -----     / Y1\       -----     / Y2\       -----     / Y3\  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

The first step of master promotion is to level the Binlog Servers in the DBSS. To do so, the most up-do-date Binlog Server must be found and all other Binlog Servers must be chained to it. In the deployment above, only three servers must be contacted, which is much easier than tens of slaves. If the most up-to-date Binlog Server is X2, levelling the Binlog Servers consists of the temporary replication architecture below.

+---------------------------------------------------------------+
|                                                               |
|   / \ <-----------------/ \-----------------> / \             |
|  / X1\----->/ \        / X2\----->/ \        / X3\----->/ \   |
|  -----     / Y1\       -----     / Y2\       -----     / Y3\  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

Levelling should happen very quickly (if it does not, one of the Binlog Servers is lagging, which should not happen). After that, the slaves will quickly follow. Once a slave is up to date (this actually does not need levelling, a slave of X2 or Y2 could have been promoted before levelling), master promotion can be performed. Shown below, a slave from the third site on the right has been chosen to be the new master, but any slave on any of the three sites could have been used.

+------------------------------------------------|--------------+
|    +---------------------+---------------------+              |
|    |                     |                     |              |
|   / \                   / \                   / \             |
|  / X1\----->/ \        / X2\----->/ \        / X3\----->/ \   |
|  -----     / Y1\       -----     / Y2\       -----     / Y3\  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

Note that the other slaves have not been touched: they are still connected to their Binlog Server. This means that this solution works well even if one of the slaves is unavailable during master promotion. This solution also works very well with delayed or lagging slaves, as those slaves are simply not good candidates for becoming the new master. For some time, the lagging slaves will process the binary logs of the old master that are still stored on the Binlog Servers.

The Trick for not Reconfiguring every Slave

Promoting a slave to be the new master in a DBSS deployment requires working some magic on a slave to make its binary log position (SHOW MASTER STATUS) matches what is expected by the Binlog Servers. Let's take an example: if the last binary log stored on the levelled Binlog Servers is binlog.000163, we could repoint the Binlog Servers to a new master if the SHOW MASTER STATUS of this new master is at the beginning of binary log filename binlog.000164.

When doing that promotion, from the point of view of the Binlog Servers, their master is simply restarted with a different server_id and server_uuid. From the point of view of the slaves, they are processing the binary logs of the old master (up to and including binlog.000163) followed by the binary logs of the new master (starting at binlog.000164).

So, the trick is to have our candidate master at the right binary log position. This can be made possible by:

configuring all nodes with binary logging enabled,
with all identical log-bin value (binlog in the example above),
and without enabling log-slave-updates.

Configuration #3 above allows us to assume that the master will consume binary log filenames much faster than the slaves. This way, the slaves will always be behind the master in their binary log filenames [3]. As such, bringing a slave to the right binary log filename is as simple as doing FLUSH BINARY LOGS in a loop until the slave is in the correct position. To avoid this loop from taking too much time, we can run a cron job on our slaves that makes sure they are not too far away from their master (maximum ten binary logs away, for an example).

Summary of Master Promotion

In the following replication deployment, with log-bin=binlog and with log-slave-updates disabled:

   +---+
   | M |
   +---+
     |
+----+----------------------------------------------------------+
|                                                               |
+----+---------+-----------+---------+-----------+---------+----+
     |         |           |         |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T1| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

If M fails, we first level the Binlog Servers in the DBSS.

Once this is done, and let's take T1 as our candidate master, we need to perform the following on it:

FLUSH BINARY LOGS until the binary log filename follows the last one from the levelled DBSS,
PURGE BINARY LOGS TO <latest binary log file>,
RESET SLAVE ALL.

The step #2 above drops all binary logs on the new master that could conflict with the one from the previous master. The binary logs of the old master are stored on the DBSS and we must be sure to avoid having similar, but misleading data, on the new master.

We now have this:

   +\-/+
   | X |
   +/-\+

+---------------------------------------------------------------+
|                                                               |
+----+---------+---------------------+-----------+---------+----+
     |         |                     |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T1| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

where we repoint the DBSS to T1 to get the following:

   +\-/+                 +---+
   | X |                 | T1|
   +/-\+                 +---+
                           |
+--------------------------+------------------------------------+
|                                                               |
+----+---------+-----------+---------+-----------+---------+----+
     |         |           |         |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T2| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

and we have achieved master promotion without reconfiguring all slaves.

A Cleaner Way

The trick above works well, but preforming FLUSH BINARY LOGS in a loop is not the cleanest of solutions. It would be much better if there was a way to set the binary log to the desired filename in a single operation. With this idea in mind, we created the following two feature requests:

MariaDB 10.1.6 is already implementing a RESET MASTER TO syntax. Let's hope that Oracle will provide something similar in MySQL 5.7.

What about the Software?

This idea and procedure is all well and good, but it is not very useful if you cannot use it yourself. The currently available version of the Binlog Server, the MaxScale Binlog Router plugin, does not yet implement all the configuration hooks needed to make this procedure easy. Booking.com is currently working with MariaDB to implement the missing hooks in a new version of MaxScale. We are in the last testing phase of a Binlog Router plugin that support the following:

STOP SLAVE, START SLAVE, SHOW MASTER STATUS, SHOW SLAVE STATUS, CHANGE MASTER TO: these new commands allow easier configuration of the Binlog Server.
The CHANGE MASTER TO command not only allows to easily chain Binlog Servers, but also to bootstrap a Binlog Server without editing the configuration file. Moreover, this command allows to repoint MaxScale to a new master at binary log filename N+1, effectively enabling to perform master promotion.
Transaction safety: when the master fails, the Binlog Server could have downloaded a partial transaction. If we replace the master with a slave, this transaction should not be sent to slaves. So this feature of the next version of MaxScale will make sure such partial transactions are not sent downstream.
DBSS identity: the initial design of the Binlog Server was intended to impersonate the master, and did not consider swapping the master at the top of the hierarchy. In a DBSS deployment, swapping the master should not be made visible to slaves, so the Binlog Servers should present the slave with a different server_id and server_uuid to those of the master. The next version of the MaxScale Binlog Router supports that virtual master feature.

This next version of the MaxScale Binlog Router will be generally available once we are done with the testing. Stay tuned on the MariaDB web site for the announcement and the failover procedure. In the meantime, you can still experiment with master promotion without reconfiguring all slaves by using the current version of MaxScale and following this proof of concept procedure.

If you are interested in this topic and would like to learn more, I am giving a talk about Binlog Servers at Percona Live Amsterdam. Feel free to grab me after the talk, catch me at the Booking.com booth (#205) or share a drink with me at the Community Dinner, to exchange thoughts on this subject. (You can also post a comment below.)

I will also be giving a talk about Binlog Servers at Oracle Open World in San Francisco at the end of October.

One last thing: if you want to know more about other cool things we do at Booking.com, I suggest you come to our other talks at Percona Live Amsterdam in September:

[1] Slave levelling can be done with MHA, with MySQL 5.6 or MariaDB 10.0 GTIDs, or with Pseudo-GTIDs when using earlier versions of MySQL and MariaDB.

[2] If we were not concerned about WAN bandwidth, all Binlog Servers could be directly connected to the master. Another solution could be to connect all master-local Binlog Servers directly to the master and to use the chained strategy for remote Binlog Servers. (This hybrid deployment could be well-suited to a semi-sync deployment, but I am diverging from the subject of this post.)

[3] The same can be achieved when using log-slave-updates, by using smaller max_binlog_size on the master than on all the slaves.

Are you afraid to have your designs critiqued? If so you’re not alone, but it’s worth considering how this affects your ability to iterate on your designs and improve the product for your customers.

As designers, we sometimes struggle to ask for feedback during the design process. We get attached to our ideas because of the long hours we spend perfecting our designs. We may also be afraid to show our work in progress, because we don’t want to be judged too harshly on these early attempts. Society reinforces a fear of failure throughout our lives, highlighting achievements but rarely discussing the failures it took to reach them.

There are steps we can take to overcome these mental blocks, that inhibit us from delivering the best experience to our customers. The first step is changing your own attitude towards failure and criticism. The second is to build a culture within your company which recognises failure and iteration as necessary to create a great product for your customers, which is something we value at Booking.com.

“You are not your idea, and if you identify too closely with your ideas, you will take offense when they are challenged.” Catmull, E. and Wallace, A. (2014) Creativity, Inc.: Overcoming the Unseen Forces That Stand in the Way of True Inspiration

While you may feel that your initial design is a reflection of your skill, it’s easier to view it as a concept to be shaped through several iterations. Then you can see it as being constantly in flux and you will continue to adapt your ideas based on new information, data or feedback. At Booking.com we encourage this, along with a healthy dose of humility, because we must let our designs be part of the whole experience and not something to be exclusively owned.

When we are young we learn by doing, children will test multiple possibilities to solve a problem. No one tells them that it’s wrong to experiment or that they must find the correct answer on the first try. Taking a cue from our younger selves, we can be open to experimentation, allowing us to iterate through various concepts and continually improve.

Experimentation is the key to validating our ideas at Booking.com. Only a small percentage of experiments are successful, but if we aren’t willing to fail, we also won’t be able to learn how to improve the product for our customers.

“This is one of the most important lessons of the scientific method: if you cannot fail, you cannot learn.” Ries, E. (2011). The Lean Startup: How Today's Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses.

We can experiment on everything: from the designs on our website to new internal processes. If someone has an idea, they are empowered to implement it, get feedback and iterate. Experimentation can include: A/B testing, creating a prototype during a Hackathon, qualitative user testing or trying out a new process on your team members. Being in a constant state of change means we must be agile and take on new challenges, both individually and in our teams.

We are constantly receiving feedback from our users. So when an idea fails, we can’t take it personally, we must rethink our hypothesis or implementation and try again. We see all user feedback as good feedback, because it takes us closer to delivering what the customer wants.

A work-in-progress design is often not beautiful. You may feel that sharing a design in this embryonic stage could be damaging to your credibility. That people may believe you’re an inferior designer from your rough sketches, wireframes or unpolished layouts. But the upside to getting feedback early is that you can respond to problems and change your idea, interaction pattern or layout quickly, which reduces rework time.

“The trick to having good ideas is not to sit around in glorious isolation and try to think big thoughts. The trick is to get more parts on the table.” Johnson, S. (2010). Where Good Ideas Come From: The Natural History of Innovation

We often have the image of a creative genius on their own, conjuring amazing designs. However, most of us know this is rarely the case. By asking for feedback on your idea or design from your peers, you access different perspectives. If you find holes in your own logic because you can’t justify your ideas, this could be a red flag that your concept needs to be reworked.

With the spirit of experimentation, it becomes important for all of us to support each other on the journey. At Booking.com there are many ways for designers to get feedback and encouragement from their peers. These can include: face to face chats with peers, an internal chatroom, a designer email list, meeting with designers in your track for feedback and a design sharing tool akin to Dribbble (which was produced during several Hackathons). All of these produce a similar result - quick feedback that can be used to improve your idea or design. This decreases the time between getting our ideas in front of customers, collecting their feedback through experimentation and continuing to iterate on the design.

“Failure isn’t a necessary evil. In fact, it isn’t evil at all. It is a necessary consequence of doing something new.” Catmull, E. and Wallace, A. (2014). Creativity, Inc.: Overcoming the Unseen Forces That Stand in the Way of True Inspiration

By accepting that failing is part of learning, we decrease our fear of failure and become more willing to experiment with new ideas. As we experiment and seek feedback, we will see how this benefits our customers, by creating a great product that is built on data and not opinions. Innovation won’t happen without failure. We must embrace it to continue learning and grow!

Booking.com constantly monitors, inspects, and analyzes our systems in order to make decisions. We capture and channel events from our various subsystems, then perform real-time, medium and long-term computation and analysis.

This is a critical operational process, since our daily work always gives precedence to data. Relying on data removes the guesswork in making sound decisions.

In this series of blog posts, we will outline details of our data pipeline, and take a closer look at the short and medium-term storage layer that was implemented using Riak.

Introduction to Events Storage

Booking.com receives, creates, and sends an enormous amount of data. Usual business-related data is handled by traditional databases, caching systems, etc. We define events as data that is generated by all the subsystems on Booking.com.

In essence, events are free-form documents that contain a variety of metrics. The generated data does not contain any direct operational information. Instead, it is used to report status, states, secondary information, logs, messages, errors and warnings, health, and so on. The data flow represents a detailed status of the platform and contains crucial information that will be harvested and used further down the stream.

To put this in numerical terms - we have more than billions of events per day, streaming at more than 100 MB per second, and adding up to more than 6 TB per day.

Here are some examples of how we use the events stream:

Visualisation: Wherever possible, we use graphs to express data. To create them, we use a heavily-modified version of Graphite.
Looking for anomalies: When something goes wrong, we need to be notified. We use threshold-based notification systems (like seyren) as well as a custom anomaly detection software, which creates statistical metrics (e.g. change in standard deviation) and alerts if those metrics look suspicious.
Gathering errors: We use our data pipeline to pass stack traces from all our production servers into ElasticSearch. Doing it this way (as opposed to straight from the web application log files) allows us to correlate errors with the wealth of the information we store in the events.

These typical use-cases are made available in less than one-minute after the related event has been generated.

High Level overview

This is a very simplified diagram of the data flow :

--------------senders|box2|...|boxn|(e.g.Webapps)--------------||----------------------aggregators|collect1|...|collectm|----------------------||--------------storage|Riak|======>|Hadoop|--------------||||-----------------------|shortterm||longterm|datacrunching|consumers||consumers|-----------------------

We can generate events by using literally any piece of code that exists on our servers. We pass a HashMap to a function, which packages the provided document into a UDP packet and sends it to a collection layer. This layer aggregates all the events together into "blobs", which are split by seconds (also called epochs) and other variables. These event blobs are then sent to the storage layer running Riak. Finally, Riak sends them on to Hadoop. The Riak cluster is meant to safely store around ten days of data. It is used for near real-time analysis (something that happened seconds or minutes ago), and medium-term analysis of relatively small amounts of data. We use Hadoop for older data analysis or analysis of a larger volume of data.

The above diagram is a simplified version of our data flow. In practical application, it's spread across multiple datacenters (DC), and includes an additional aggregation layer.

Individual Events

An event is a small schema-less[1] piece of data sent by our systems. That means that the data can be in any structure with any level of depth, as long as the top level is a HashTable. This is crucial to Booking.com - the goal is to give as much flexibility as possible for the sender, so that it's easy to add or modify the structure, or the type and number of events.

Events are also tagged in four different ways:

the epoch at which they were created
the DC where they originated
the type of event
the subtype.

Some common types are:

WEB events (events produced by code running under a web server)
CRON events (output of cron jobs)
LB events (load balancer events)

The subtypes are there for further specification and can answer questions like: "Which one of web server systems are we talking about?".

Events are compressed Sereal blobs. Sereal is possibly the best schema-less serialisation format currently available. It was also written at Booking.com.

An individual event is not very big, but a huge number of them are sent every second.

We use UDP as transport because it provides a fast and simple way to send data. Despite some (very low) risk of data loss, it doesn't impact senders sending events. We are experimenting with an UDP-to-TCP relay that will be local to the senders.

Aggregated Events

Literally every second, events from this particular second (called epoch), DC number, type, and subtype are merged together as an Array of events on the aggregation layer. At this point, it's important to try and get the smallest size possible, so the events of a given epoch are re-serialized as a Sereal blob, using these options:

compress=>Sereal::Encoder::SRL_ZLIB,dedupe_strings=>1

dedupe_strings increases the serialisation time slightly. However it removes strings duplications which occur a lot since events are usually quite similar between them. We also add gzip compression.

We also add the checksum of the blob as a postfix, to be able to ensure data integrity later on. The following diagram shows what an aggregated blob of events looks like for a given epoch, DC, type, and subtype. You can get more information about the Sereal encoding in the Sereal Specification.

This is the general structure of an events blob:

<----------- Sereal part -----------><------ suffix ----->3D F3 726C 3300 DE 36 FD 1F78...983D 63686B 34...33`----+----' |  |  `-+-' `-+-' `--+--' `---+-----' `------'|||||||||||||||   SHA1 of Sereal
     ||||||     string "=chk"|||||      `-> gz compressed payload
     ||||     `-> varint compressed size: 32031|||    `-------> varint uncompressed : 24118||  `------------> header-suffix-size: 0|      `---------------> version-type: version 3 gzip
     `----------------------> magic string

The compressed payload contains the events themselves. It's an Array of HashMaps, Serialized in a Sereal structure and gzip-compressed. Here is an example of a trivial payload of two events, as follows:

[{cpu=>5},{cpu=>99}]

And the gzipped payload would be the compressed version of this binary string:

,---------------------------->ARRAYREF_2,Arrayoflength2|,------------------------->REFN,areferenceon...||,---------------------->...aHASH,ofsize...|||,------------------->...POS_1,thevalue1.||||,---------------->FirstkeyisaSHORT_BINARY_3|||||,---------->thestring"cpu"||||||,---->POS_5,thevalue5.|||||||,->asbefore,HASH_1:||||||||,->keysisCOPYof...|||||||||,->the6thbyte:cpu||||||||||,->aVARINT|||||||||||99|||||||||||||||||--+---|--+---|||-+-42282A016363707505282A012F0520630A`--------------+---------------' `-----+------'||firstelementoftheArraysecondelement

It can be hard to follow these hexdigits [2], yet it's a nice illustration of why the Sereal format helps us to reduce the size of serialised data. The second array element is encoded on far fewer bytes than the first one, since the key has already be seen. The resulting binary is then re-compressed. The Sereal implementation offers multiple compression algorithms, including Snappy and gzip.

A typical blob of events for one second/DC/type/subtype can weight anywhere from several kilobytes to several megabytes, which translates into a (current) average of around 250 gigabytes per hour.

Side note: smaller subtypes on this level of aggregation aren't always used, because we want to minimise the data we transmit over our network by having good compression ratios. Therefore we split types into subtypes only when the blobs are big enough. The downside to this approach is that consumers have to fetch data for the whole type, then filter out only subtypes they want. We're looking at ways to find more balance here.

Data flow size and properties

Data flow properties are important, since they're used to decide how data should be stored:

The data is timed and all the events blobs are associated with an epoch. It’s important to bear in mind that events are schema-less, so the data is not a traditional time series.
Data can be considered read-only; the aggregated events blobs are written every second and almost never modified (history rewriting happens very rarely).
Once sent to the storage, the data must be available as soon as possible

Data is used in different ways on the client side. A lot of consumers are actually daemons that will consume the fresh data as soon as possible - usually seconds after an event was emitted. A large number of clients read the last few hours of data in a chronological sequence. On rare occasions, consumers access random data that is over a few days old. Finally, consumers that want to work on larger amounts of older data would have to create Hadoop jobs.

There is a large volume of data to be moved and stored. In numerical terms:

Once serialized and compressed into blobs, it is usually larger than 50 MB/s
That's around 250 GB per hour and more than 6 TB per day
There is a daily peak hour but the variance of the data size is not huge: There are no quiet periods
Yearly peak season stresses all our systems, including events transportation and storage, so we need to provision capacity for that

Why Riak

In order to find the best storage solution for our needs, we tested and benchmarked several different products and solutions.

The solutions had to reach the right balance of multiple features:

Read performance had to be high as a lot of external processes will use the data.
Write security was important, as we had to ensure that the continuous flow of data could be stored. Write performance should not be impacted by reads.
Horizontal scalability was of utmost importance, as our business and traffic continuously grows.
Data resilience was key: we didn't want to lose portions of our data because of a hardware problem.
Allowed a small team to administrate and make the storage evolve.
The storage shouldn't require the data to have a specific schema or structure.
If possible, it would be able to bring code to data, perform computation on the storage itself, instead of having to get data out of the storage.

After exploring a number of distributed file systems and databases, we chose Riak over distributed Key-Value stores. Riak had good performance and predictable behavior when nodes fail and when scaling up. It also had the advantage of being easy to grasp and implement within a small team. Extending it was very easy (which we'll see in the next part of this series of blog posts) and we found the system very robust - we never had to face dramatic issues or data loss.

Disclaimer: This is not an endorsment for Riak. We compared it carefully to other solutions over a long period of time and it seemed to be the best product to suit our needs. As an example, we thoroughly tested Cassandra as an alternative: it had a larger community and similar performance but was less robust and predictable; it also lacked some advanced features. The choice is ultimately a question of priorities. The fact that our events are schema-less made it almost impossible for us to use solutions that require knowledge of the data structures. Also we needed a small team to be able to operate the storage, and a way to process data on the cluster itself, using MapReduce or similar mechanisms.

Riak 101

The Riak cluster is a collection of nodes (in our case physical servers), each of which claims ownership of a given key. Depending on the chosen replication factor, each key might be owned by multiple nodes. You can ask any node for a key and your request will be redirected to one of the owners. Same goes for writes.

On closer inspection of Riak, we see that keys are grouped into virtual nodes. Each physical node can own multiple virtual nodes. This simplifies data rebalancing when growing a cluster. Riak does not need to recalculate the owner for each individual key; it will only do it per virtual node.

We won't cover Riak architecture in a great detail in this post, but we recommend reading the following article for further information.

Riak clusters configuration

The primary goal of this storage is to keep the data safe. We went with the regular replication number value of three. Even if two nodes owning the same data will go down, we won't lose our data.

Riak offers multiple back-ends for actual data storage. The main three are Memory, LevelDB, and Bitcask. We chose Bitcask, since it was suitable for our particular needs. Bitcask uses log-structured hash tables that provide very fast access. As data gets written to the storage, Bitcask simply appends data to a number of opened files. Even if a key is modified or deleted, the information will be written at the end of these storage files. An in-memory HashTable maps the keys with the position of their (latest) value in files. That way, at most one seek is needed to fetch data from the file system.

Data files are then periodically compacted, and Bitcask provides very good expiration flexibility. Since Riak is a temporary storage solution for us, we set it up with automatic expiration. Our expiration period varies. It depends on the current cluster shape, but usually falls between 8-11 days.

Bitcask keeps all of the keys of a node in memory, so keeping large numbers of individual events as key value pairs isn't trivial. We sidestep any issues by using aggregations of events (blobs), which drastically reduce the number of needed keys.

More information about Bitcask can be found here.

For our conflict resolution strategy, we use Last Write Wins. The nature of our data (which is immutable as we described before) allows us to avoid the need for conflict resolution.

The last important part of our setup is load balancing. It is crucial in an enviromnent with a high level of reads, and only 1 gigabyte network. We use our own solution for that based on Zookeeper. Zooanimal daemons are running on the riak nodes, and collect information about system health. The information is then aggregated into simple text files, where we have an ordered list of IP addresses, plus up and running Riak nodes, which we can connect to. All our Riak clients simply choose a random node to send their requests to.

We currently have two Riak clusters in different geographical locations, each of which have more than 30 nodes. More nodes equates to more storage space, CPU power, RAM, and more network bandwidth available.

Data Design

Riak is primarily a key-value store. Although it provides advanced features (secondary indexes, MapReduce, CRDTs), the simplest and most efficient way to store and retrieve data is to use the key-value model.

Riak has three concepts - a bucket is a namespace, in which a key is unique. A key is the identifier of the data; and has to be stored in a bucket. A value is the data; it has an associated mime-type, which can enable Riak awareness of its type.

Riak doesn't provide efficient ways to retrieve the list of buckets or the list of keys by default [3]. When using Riak, it's important to know the bucket and key to access. This is usually resolved by using self-explanatory identifiers.

In our case, our events are stored as Sereal-encoded blobs. From these, we know the datacenter, type, subtype, and of course the time at which it was created.

When we need to retrieve data, we always know the time we want. We are also confident in the list of our datacenters. It doesn't change unexpectedly so we can make it static for our applications. We are not always sure about what types or subtypes will appear in a given epoch for a given datacenter. On some seconds events of certain types may not arrive.

We came up with this simple data design:

events blobs are stored in the events bucket, keys being <epoch>:<dc>:<type>:<subtype>:<chunk>
metadata are stored in the epochs bucket, keys being <epoch>:<dc> and values being the list of events keys for this epoch and DC combination

The value of chunk is an integer, starting at zero, which keeps event blobs smaller than 500 kilobytes each. We use the integer to split big events blobs into smaller ones, so that Riak can function more efficiently.

We'll see this data design in action when pushing data to Riak, in the next blog post of this series.

Next post: data processing outside of Riak

The next part of this blog posts series will explain how we enter and fetch data from Riak, in order to do real-time processing and batch processing.

Stay tuned!

Notes

[1] It is not strictly true that our events are schema-less. They obey the structure that the producers found the most useful and natural. But they are so many producers which each of them sending events that have a different schema, so it's almost equivalent to considering them schema-less. Our events can be seen as structured, yet with so many schemas that they can't be traced. There is also complete technical freedom to change the structure of an event, if it's seen as useful by a producer.

[2] After spending some time looking at and decoding Sereal blobs, the human eye easily recognizes common data structures like small HashMaps, small Arrays, small Integers and VarInts, and of course, Strings, since their content is untouched. That makes Sereal an almost human readable serialisation format, especially after a hexdump.

[3] This can be worked around by using secondary indexes (2i) if the backend is eleveldb or Riak Search, to create additional indexes on keys, thus enabling listing them in various ways.

In this post, we'll see how to push data to Riak, how to read it later on, and how to perform data processing out of Riak. As a reminder, we saw in the previous part that the events are schema-less HashTable data structures, grouped by epochs, data centers, types and subtypes, then serialised using Sereal and highly compressed.

If you missed part 1

We strongly recommend that you read part one of this blog series. The previous part explains how Booking.com collects and stores events from its backend into a central storage, and why Riak was chosen to do so.

Pushing to Riak

Pushing data to Riak is done by a number of relocators, which are daemons running on the aggregation layer that then push events blobs to Riak.

Side note: it's not recommended to have keys more then 1-2MB in Riak (see this FAQ). And since our blobs can be 5-10MB in size, we shard them into chunks, 500KB each. Chunks are valid Sereal documents, which means we do not have to stich chunks together in order to retrieve data back.

This means that we have quite a lot of blobs to send to Riak, so to maximise our usage of networking, I/O, and CPU, it's best to send data in a mass-parallel way. To do so, we maintain a number of forked processes (20 per host is a good start), in which each of them push data to Riak.

Pushing data to Riak can be done using the HTTP API, or the Protocol Buffers Client (PBC) API. PBC has a slighly better performance.

Whatever protocol is used, it's important to maximise I/O utilisation. One way is to use an HTTP library that parallelises the requests in term of I/O (YAHC is an example). Another method is to use an asynchronous Riak Client like AnyEvent::Riak.

We use an in-house library to create and maintain a pool of forks, but there are more than one existing libraries on CPAN, like Parallel::ForkManager.

PUT to Riak

Writing data to Riak is rather simple. For a given epoch, we have the list of events blobs, each of them having a different DC/type/subtype combination (remember, DC is short for Data Center). For example:

1413813813:1:type1:subtype11413813813:1:type1:subtype21413813813:1:type2:1413813813:2:type1:subtype11413813813:2:type3:`----+---'|`-+-'`---+--'|||`---->optionalsubtype||`------------>type|`---------------->DCnumber`---------------------->epoch

The first task is to slice the blobs into 500 KB chunks and add a postfix index number to their name. That gives:

1413813813:1:type1:subtype1:01413813813:1:type1:subtype1:11413813813:1:type1:subtype1:21413813813:1:type1:subtype2:01413813813:1:type2::01413813813:1:type2::11413813813:2:type1:subtype1:01413813813:2:type1:subtype1:11413813813:2:type3::0

Next, we can store all the event blobs in Riak in the events bucket. We can simulate it with curl:

curl-d<data>-XPUT"http://node:8098/buckets/events/keys/1413813813:1:type1:subtype1:0"# ...curl-d<data>-XPUT"http://node:8098/buckets/events/keys/1413813813:2:type3::0"

Side note: we store all events in each of the available Riak clusters. In other words, all events from all DCs will be stored in the Riak cluster which is in DC 1, as well as in the Riak cluster which is in DC 2. We do not use cross DC replication to achieve that - instead we simply push data to all our clusters from the relocators.

Once all the events blobs are stored, we can store the metadata, which is the list of the event keys, in the epochs bucket. This metadata is stored in one key per epoch and DC. So for the current example, we will have 2 keys: 1413813813-1 and 1413813813-2. We have chosen to store the list of events blobs names as pipe separated values. Here is a simulation with curl for DC 2:

curl-d"type1:subtype1:0|type1:subtype1:1|type3::0"-XPUT"http://riak_host:8098/buckets/epochs/keys/1413813813-2"

Because the epoch and DC are already in the key name, it's not necessary to repeat that in the content. It's important to push the metadata after pushing the data.

PUT options

When pushing data to the Riak cluster, we can use different attributes to change the way data is written - either by specifying which ones when using the PBC API, or by setting the buckets defaults.

Riak's documentation provides a comprehensive list of the parameters and their meaning. We have set these parameters as follows:

"n_val":3,"allow_mult":false,"last_write_wins":true,"w":3,"dw":0,"pw":0,

Here is a brief explanation of these parameters:

n_val:3 means that the data is replicated three times
allow_mult and last_write_wins prohibit siblings values; conflicts are resolved right away by using the last value written
w:3 means that when writing data to a node, we get a success response only when the data has been written to all the three replica nodes
dw:0 instruct Riak to wait for the data to have reached the node, not the backend on the node, before returning success.
pw:0 is here to specify that it's OK if the nodes that store the replicas are not the primary nodes (i.e. the ones that are supposed to hold the data), but replacement nodes, in case the primary ones were unavailable ().

In a nutshell, we have a reasonably robust way of writing data. Because our data is immutable and never modified, we don't want to have siblings or conflict resolution on the application level. Data loss could, in theory, happen if a major network issue happened just after having acknowledged a write, but before the data reached the backend. However, in the worst case we would lose a fraction of one second of events, which is acceptable for us.

Reading from Riak

This is how the data and metadata for a given epoch is laid out in Riak:

bucket:epochskey:1428415043-1value:1:cell0:WEB:app:chunk0|1:cell0:EMK::chunk0bucket:eventskey:1428415043:1:cell0:WEB:app:chunk0value:<binaryserealblob>bucket:eventskey:1428415043:1:cell0:EMK::chunk0value:<binaryserealblob>

Fetching one second of data from Riak is quite simple. Given a DC and an epoch, the process is as follow:

Read the metadata by fetching the key <epoch>-<dc> from the bucket "epochs"
Parse the metadata value, split on the pipe character to get data keys, and prepend the epoch to them
Reject data keys that we are not interested in by filtering on type/subtype
Fetch the data keys in parallel
Deserialise the data
Data is now ready for processing

Reading a time range of data is done the same way. Fetching ten minutes of data from Wed, 01 Jul 2015 11:00:00 GMT would be done by enumerating all the epochs, in this case:

143574840014357484011435748402...1435749000

Then, for each epoch, fetch the data as previously mentioned. It should be noted that Riak is specifically tailored for this kind of workload, where multiple parallel processes perform a huge number of small requests on different keys. This is where distributed systems shine.

GET options

The events bucket (where the event data is stored) has the following properties:

"r":1,"pr":0,"rw":"quorum","basic_quorum":true,"notfound_ok":true,

Again, let's look at these parameters in detail:

r:1 means that when fetching data, as soon as we have a reply from one replica node, Riak considers this as a valid reply, it won't to compare it with other replicas.
pr:0 remove the requirement that the data comes from a primary node
notfound_ok:true makes it so that as soon as one node can't find a key, Riak considers that the key doesn't exist (notfound_ok:true).

These parameter values allow to be as fast as possible when fetching data. In theory, such values don't protect against conflicts or data corruption. However, in the "Aggregated Events" section (see the first post), we've seen that every event blob has a suffix checksum. When fetching them from Riak, this enables the consumer to verify that there is no data corruption. The fact that the events are never modified ensures that no version conflict can occur. This is why having such "careless" parameter values is not an issue for this use case.

Real time data processing outside of Riak

After the events are properly stored in Riak, it's time to use it. The first usage is quite simple: extract data out of it and process it on dedicated machines, usually grouped in clusters or aggregations of machines that perform the same kind of analysis. These machines are called consumers, and they usually run daemons that fetch data from Riak, either continuously or on demand. Most of the continuous consumers are actually small clusters of machines spreading the load of fetching data.

------------------|Riak||eventsstorage|------------------/|\70MB/s  /70MB/s |    . . .  \ 70MB/s/|\/|\/|\------------------------|--------|--------|--------||--------||--------||--------|||||||||||||||fetchers||||fetchers|...|||fetchers||||||||||||||||------------------------consumerconsumerconsumerclusterclustercluster

Some data processing is required at near real-time. This is the case for monitoring, and building graphs. Booking.com heavily uses graphs at every layer of its technical stack. A big portion of graphs are generated from Events. Data is fetched every second from the Riak storage, processed, and dedicated graphing data is sent to an in-house Graphite cluster.

Other forms of monitoring also consume the events stream- fetched continuously and aggregated in per-second, per-minute, and daily aggregations in external databases, which are then provided to multiple departments via internal tools.

These kind of processes try to be as close as possible to real-time. Currently there are 10 to 15 seconds of lag. This lag could be shorter: a portion of it is due to the collection part of the pipeline, and an even bigger part of it is due to the re-serialisation of the events as they are grouped together, to reduce their size. A good deal of optimisation could be done there to reduce the lag down to a couple of seconds [1]. However, there was no operational requirement for reducing it and 15 seconds is small enough for our current needs.

Another way of using the data is to stick to real-time, but accumulate seconds in periods. One example is our Anomaly Detector, which continuously fetches events from the Riak clusters. However, instead of using the data right away, it accumulates it on short moving windows of time (every few minutes) and applies statistical algorithms on it. The goal is to detect anomalous patterns in our data stream and provide the first alert that prompts further action. Needless to say, this client is critical.

Another similar usage is done when gathering data related to A/B testing. A large number of machines harvest data from the events' flow before processing it and storing the results in dedicated databases for use in experimentation-related tooling.

There are a number of other usages of the data outside of Riak, including manually looking at events to check new features behaviours or analysing past issues / outages.

Limitations of data processing outside of Riak

Fetching data outside of the Riak clusters raises some issues that are difficult to work around without changing the processing mechanism.

First of all, there is a clear network bandwidth limitation to the design: the more consumer clusters there are, the more network bandwidth is used. Even with large clusters (more than 30 nodes), it's relatively easy to exhaust the network capacity of all the nodes as more and more fetchers try to get data from them.

Secondly, each consumer cluster tends to use only a small part of the events flow. Even though consumers can filter out types, subtypes, and DCs, the resulting events blobs still contain a large quantity of data that is useless to the consumer. For storage efficiency, events need to be stored as large compressed serialised blobs, so splitting them more by allowing more subtyping is not possible [2].

Additionally, statically splitting the events content is too rigid since use of the data changes over time and we do not want to be a bottleneck to change for our downstream consumers. Part of an event from a given type that was critical 2 years ago might be used for minor monitoring now. A subtype that was heavily used for six month may now be rarely used because of a technical change in the producers.

Finally, the amount of CPU time needed to uncompress, load, and filter the big events blobs is not tiny. It usually takes around five seconds to fetch, uncompress, and filter one second's worth of events. Which means that any real-time data crunching requires multiple threads and likely multiple hosts - usually a small cluster. It would be much simpler if Riak could provide a real-time stream of data exactly tailored to the consumer need.

Next post: data filtering and processing inside Riak

What if we could remove the CPU limitations by doing processing on the Riak cluster itself? What if we could work around the network bandwidth issue by generating sub-streams on the fly and in real-time on the Riak cluster?

This is exactly what we implemented, using simple concepts, and leveraging the ease of use and hackability of Riak. These concepts and implementations will be described in the next part of this blog posts series!

Notes

[1] Some optimisation has been done, the main action was to implement a module to split a sereal blob without deserialising it, thus speeding up the process greatly. This module can be found here: Sereal::Splitter. Most of the time spent in splitting sereal blobs is now spent in decompressing it. The next optimization step would be to use compression that decrunches faster than the currently used gzip; for instance LZ4_HC.

[2] At that point, the attentive reader may jump in the air and proclaim "LevelDB and snappy compression!". It is indeed possible to use LevelDB as Riak storage backend, which provides an option to use Snappy compression on the blocks of data stored. However, this compression algorithm is not good enough for our need (using gzip reduced the size by a factor of almost 2). Also, Leveldb (or at least the eleveldb implementation that is used in Riak) doesn't provide automatic expiration which is critical to us, and had issues with reclaiming free space after key deletions, with versions below 2.x

At Booking.com, hundreds of developers and designers contribute daily to our codebase, which leads to potential complications with code discoverability, maintenance, and reuse. In this post, we're going to focus on the client-side aspect of these challenges, and introduce some of the techniques we use to tackle them.

Prefixing and Namespacing

Because of the size of our codebase and the number of people introducing changes to it daily, it can be difficult to maintain unique identifiers for all of our pages' components.

Both JavaScript and Cascading Style Sheets make use of identifiers to work properly, by means of variable names and selectors respectively, both on a global execution scope.

Let's start with JavaScript. We make use of namespaces, which are object properties of a global variable. (We also implement modules, which will be discussed later in the article.)

// The only global variablevarB={};// The search namespaceB.search={};// Check-in date of the searchB.search.checkIn=newDate(2015,3,1);// Travel purpose of the searchB.search.travelPurpose='business';

In the example above, B.search is the namespace we're using to visually identify our code. Notice how travelPurpose has its scope within search, clarifying its context and meaning.

CSS is different. Because CSS doesn't provide a scalable way of grouping selectors, we make use of prefixes instead. We also make sure all selectors are as specific as possible - to prevent collisions. For example, in our files we already have about 4,000 class selectors containing the term item in their declaration.

Imagine the following simple case: a list of facilities on the hotel page.

<ulclass="facilities"><liclass="item">Wi-Fi</li><liclass="item">Breakfast</li></ul>

That might interfere with another team's work that is adding a universal menu on the website's header.

<ulclass="menu"><liclass="item">Home</li></ul>

On both cases, .item will have specific CSS rules that could be overridden, thus generating unexpected results. Sometimes these interactions happen on a specific page that was beyond the scope of the developer's tests.

To prevent these conflicts we often use prefixing:

<ulclass="hp-facilities"><liclass="hp-facilites__facility">Wi-Fi</li><liclass="hp-facilites__facility">Breakfast</li></ul>

Since we invest so much into experimentation through A/B testing, a considerable amount of code becomes irrelevant when its related experiment expires.

Because we want to avoid code rot in our codebase, we want to keep only the parts that we actually need, and those irrelevant pieces of code must be removed periodically. Being able to quickly search the code for a specific token, such as a CSS class name, is a key requirement for this clean up.

Control of Execution

It is very important that our JavaScript code runs in a controlled fashion. Our code needs to be precise by only executing when it is necessary for a certain feature, page, or event. It also needs to be robust, preventing interference between unrelated features.

Suppose we have three script files concatenated, and the resulting file is added to every page on our website.

// from tooltip.js$('.tooltip').addTooltip();// from available_rooms.jsvarprices=$('#prices .price');prices[0].scrollTop=0;// from sticky_user_bar.js$(window).scroll(function(){$('.user_bar').css('top',document.body.scrollTop);});

Any part that fails will prevent the next part from executing. For instance, if there is no element on the page that matches #prices .price there will be an error stopping everything else from executing, breaking the user bar behavior.

Also, there might be code that shouldn't be running, which is undesirable. In the example, if no element matches .user_bar on the page, the scrolling event is still monitored, wasting CPU cycles.

To mitigate this, we chose to develop an internal flow controller that provides an execution control API based on AMD, built on top of Almond. It uses setTimeout to provide a separate context, which then isolates failures and prevents the blockage of the main thread.

As a bonus, because the code is wrapped, we can easily label those calls for profiling and find features that might be wasting resources.

This also helps isolate the experimented part of our A/B testing, making it easy to find and clean up failed tentatives, as mentioned in the previous section.

Restricting Your Execution

While CSS code avoids clashes by using namespaces, JavaScript should not leak behavior to DOM elements or to other unrelated components in the same page.

Part of that problem can be prevented by following well-established coding principles, such as avoiding global variables (enforced by using the strict mode), modularizing the specific parts of your code, and so forth.

We also encourage our teams to develop context specific software to avoid side-effects.

$('form').on('submit',function(){vardestination=$('.destination');});

Here, every form element will have a submit event handler attached to it. Also, it searches the entire document DOM tree for the .destination selector, which might stop working as soon as someone inadvertently adds an element that matches the same selector.

An even better approach stresses specifying targets in more detail, aiming to only affect what needs to be affected.

$('.js-searchbox-form').on('submit',function(event){varform=$(event.currentTarget);vardestination=form.find('.js-destination-input');});

In this scenario, the class names are clear and specific, and the code will only look for elements inside of its own form, preventing possible leaking.

Modularization

Having multiple teams working independently at the same time allows different development styles across the codebase. One developer might like wrapping her own code on IIFE, and another might prefer the prototype pattern.

While this is not a problem if the code is achieving a simple task, more complex solutions might become too big to understand, to manipulate, or to maintain.

functionshowTheGallery(hotelId){/* Here goes 31415 lines of code involving all event handlers, the overlay behind the gallery, capturing the keyboard events to navigate and close the gallery, logic to preload the images, logic to center the main image relative to another element, scrolling of the thumbnails, obtaining the images from a specific variable in the code, showing the gallery, etc. */}showTheGallery(42);

As you can see, parts of the code can become too complicated and isolated, making it difficult to understand and debug. It also prevents any kind of reusability.

However, we can break the code into smaller blocks that serve a specific purpose, as described in the DRY principle as "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system".

define('photo-gallery',['component-maker','dom-position','dom-scroll','env-variables','image-preload','keyboard-events','overlay'],function(...){// Tying them all together nicely, exporting an extensible component});

Here, every dependency is self-contained, specific enough, and totally reusable by others, and the resulting object allows quick extension and behavior changing, so the original code can be adapted.

Components

Following the principles of restricting the behavior of your code to exactly where you want it to run, and the fact that we want to build a modularized and reusable codebase, we developed a simple solution called B.components.

The principle behind it is to add behavior to one or more DOM nodes. It only executes the code when the element exists, and allows one component to extend the features of another, facilitating reusability.

<buttontype="button"data-component="alert">Alert</button>

In this example, we add behavior to a specific button in the code. The JavaScript doesn't need to know which exact element to target in the document, since it's the button that requests a behavior, not the other way around.

The code receives a pointer to the DOM node, and can perform the necessary action, such as listening to click events on this reference and triggering an alert window.

The benefit of this approach is its DOM-based flexibility. We might change every aspect of the HTML, and even add more elements, while preserving the same behavior.

<adata-component="alert">Alert Anchor</a><spandata-component="alert">Alert Span</span>

We use AMD as the foundation to store the component definitions as modules, the same setTimeout technique mentioned before for containing the execution, and to create a new instance for each component definition found in the DOM or any specified node.

This solution frees us from knowing exactly what to do when rendering dynamic HTML on the page.

vartemplateCode='Complex HTML structure';$('body').append(templateCode);// We added tooltips, so initialize them$('.tooltip').tooltip();// We also added a lightbox, but that uses another APILightboxFactory.addLightbox('#lightbox-a',{lightbox:'options'});// Did we forget something? Did some API change?

This is all replaced by a one-liner:

$(templateCode).appendTo('body').loadComponents();

The method $.fn.loadComponents will take care of finding the necessary components, and each component will be given the opportunity to initialize itself, all under the same predictable API.

The Big Picture

Because we are a big company with hundreds of developers, we exercise care so that our contributions leave our code better than we found it, keeping it more maintainable in the long run.

Organizing and namespacing our CSS and JavaScript blocks helps to make our code easily findable and robust. Controlling our executed code scope and only running it when really necessary makes it predictable and fast. Thinking about modules and UI components ensures we are able to quickly understand and reuse our solutions. All of this is done while keeping a good sense of community and respect for each other.

These are just some of the techniques we adopted and created at Booking.com to help us deal with the challenges of growth.

The important lesson is to always look a the big picture, never assume you are in a perfectly controlled environment.

Code must be resilient.

When we introduce new features on our website, sometimes it’s not simply the behaviour of our users that changes. The behaviour of our own systems can change, too.

For example: A new feature might improve conversion (changing a user’s behaviour) while also slowing down our site rendering (changing the behaviour of our systems). This becomes interesting when you realize that the second effect might influence the first – a rendering slowdown might decrease conversion, for instance.

Sometimes, the opposing effects that turn up in our results make for interesting investigation and developments. At Booking.com, we’ve found a way to use multivariant testing to quantify these opposing effects.

Moving forward from traditional A/B testing

We don't change anything on our website without first validating it (traditionally through A/B split testing) and all but a select few of our experiments run on 100% of our eligible traffic. Small CSS changes, the introduction of new features, and even infrastructure changes like a new availability search engine, a new translation infrastructure, or a backend Perl version upgrade all must first go through the same testing.

It shouldn't therefore be any surprise that we run in excess of a thousand experiments in parallel. Our in-house experimentation platform is very well integrated into the foundations of our stack to make this possible. And many of the new features that get put through the experimentation process come with changes to a few layers.

Suppose we want to A/B test the effects of adding new useful data to our country landing page (the page where we hope you land after searching your favourite search engine for "Hotels in Italy"). We think that our visitors might be interested in having some Italian visa information available on that page. Such a feature needs a new data store for the visa information, extra effort to collect that data, UX design, and it won’t even be available for all our visitors as it probably won't be fully-formed in the beginning.

For the sake of argument, let's say that getting the Italian visa information is a kind-of-expensive operation. Let's also say that we don't have a whole lot of this content created yet – maybe we only have visa information written up for 10% of the countries our visitors want.

So, before investing effort into optimizing the data store or extending the data set, we'd like to run an A/B test to find out whether our visitors actually even like the feature. Here's the first way we could choose to implement this experiment (let's call it 'Experiment 1'):

Experiment 1: The simple A/B test

# data store query, introduces some performance overhead
have_data= get_visa_data()

# track_experiment() returns False if this user is in the control group, and
# True for the treatment group
if have_data and track_experiment("show_visa_data"):
    render_visa_data()

What's good about this setup is that the difference in metrics like the number of bookings made or customer service interactions between the unchanged original and the experiment variant can tell us whether the users liked this feature.

Unfortunately, it doesn't tell us anything about the business impact of having this feature: the performance impact of ‘get_visa_data()’ may be driving some users away from our website, and we’re not measuring that in the way we’ve decided to set up Experiment 1.

You might then choose to address this issue by implementing the experiment in a different way.

Experiment 2: The A/B test refined

if track_experiment("show_visa_data"):
    have_data= get_visa_data()
    if have_data:
        render_visa_data()

This experiment accounts for and measures business impact, which Experiment 1 couldn’t do. But there's a downside: if we have visa data for only 10% of our visitors, we would expose only 5% of the visitors in this experiment to the visible change that might entice them to convert better. It’s likely that this dilutes the effect so much as to be unmeasurable.

In other words, Experiment 2 is very likely to come out inconclusive. It might even be negative, due to the performance cost incurred for all users in the variant. If it is inconclusive (or negative), that might mean one of these two cases:

Our users don't care about visa information at all;
Our users love visa information, but that effect was diluted by the low availability of the data and/or negated by the negative impact of the performance cost.

It would be very valuable if we could tell (1) from (2) as it would influence our decision-making process and point at what we should do next. In case (1), we would abandon the idea and decide to better focus our effort on the next idea that comes along. In case (2) however, we can invest time in extending the data set and/or optimizing the data store.

We found a solution to that which is now quite commonplace in our organization. Here's what we do: instead of running an A/B test, we run a multivariant test (an "A/B/C" test):

Experiment 3: multivariant test

No change;
Get the visa data. Even if there's data, don't render it;
Get the visa data. If there's data for this visitor, render it.

The comparison between (A) and (C) now gives the same data as Experiment 2 from the example above and tells you about the complete business impact (the one drawback of this three-way split is that it comes with a slight loss of statistical power compared to a 50/50 A/B split).

The comparison between (A) and (B) tells us about the impact of doing the data lookup. In other words, it tells us whether there may be something to gain by improving performance there.

The comparison between (B) and (C) tells us whether users for whom we had data liked the feature. As it is, we are quite unlikely to detect this difference as the visual change is only available for 10% of visitors in both groups. Fortunately, our platform makes it very easy to "zoom in" on this 10% subgroup and have a full experiment report that only includes them, giving us much better odds at detecting a difference – It actually gives us the same comparison as what Experiment 1 yields without any of the downsides.

This Experiment 3 multivariant setup adds another exciting possibility. While normal A/B tests can only give us a yes/no decision about enabling the feature for all our users, the new setup has an additional possible outcome akin to saying that a feature is promising but needs improvement.

As a company, we take pride in taking small steps towards optimizing our website, measuring along the way and learning from every result. This multivariant experimentation setup has proved to be a great resource in our toolbox with which to do just that.

In this post, we'll see how to apply transformations to the events data stored in Riak without the data leaving the cluster. We saw in the previous parts how to gather, aggregate and store events in Riak, and how to fetch them for external processing. We'll see now how to reduce bandwidth usage by applying data transformation without moving the events outside of the cluster.

If you missed Part 2

We strongly recommend that you read part 2 of this blog series. The previous parts explains how Booking.com collects and stores events from its backend into a central storage, and how we use it to do events analysis.

The Theory

The reasoning is actually very simple. The final goal is to perform data processing of the events blobs that are stored in Riak in real-time. Data processing usually produces a very small result, and it appears to be a waste of network bandwidth to fetch data outside of Riak to perform data analysis on consumer clusters, as in this example::

This diagram is equivalent to:

So instead of bringing the data to the processing code, let's bring the code to the data:

This is a typical use case for MapReduce. We're going to see how to use MapReduce on our dataset in Riak, and also why it's not a usable solution.

For the rest of this post, it’s important to establish a reference for all the events that are stored for a time period of exactly one second. Because we already happen to store our events by a second (and call it an “epoch”), using this unit of measure is a practical consideration that we’ll refer to as epoch-data.

A first attempt: MapReduce

MapReduce is a very well known (if somewhat outdated) way of bringing the code near the data and distributing data processing. There are excellent papers explaining this approach for further background study.

Riak has a very good MapReduce implementation. MapReduce jobs can be written in Javascript or Erlang. We highly recommend using Erlang for better performance.

To perform events processing of an epoch-data on Riak, the MapReduce job would look like the following list. Metadata and data keys concepts are explained in the part 2 of the blog series. Here are the MapReduce phases:

Given a list of epochs and DCs, the input is the list of metadata keys, and as additional parameter, the processing code to apply to the data.
A first Map phase reads the metadata values and returns a list of data keys.
A second Map phase reads the data values, deserialises it, applies the processing code and returns the list of results.
A Reduce phase aggregates the results together

This works just fine. For one epoch-data, one data processing code is properly mapped to the events, the data deserialised and processed in around 0.1 second (on our initial 12 nodes cluster). This is by itself an important result: it's taking less than one second to fully process one second worth of events. Riak makes it possible to implement a real-time MapReduce processing system [1].

Should we just use MapReduce and be done with it? Not really, because our use case involves multiple consumers doing different data processing at the same time. Let's see why this is an issue.

The metrics

To be able to test the MapReduce solution, we need a use case and some metrics to measure.

The use case is the following: every second, multiple consumers (say 20) need the result of one of the data processing (say 10) of the previous second.

We'll consider that an epoch-data is roughly 70MB, data processing results are around 10KB each. Also, we'll consider that the Riak cluster is a 30 nodes ring with 10 real CPUs available for data processing on each node.

The first metric we can measure is the external network bandwidth usage. This is the first factor that encouraged us to move away from fetching the events out of Riak to do external processing. External bandwidth usage is the bandwidth used to transfer data between the cluster as a whole, and the outside world.

The second metric is the internal network bandwidth usage. This represents the network used between the nodes, inside of the Riak cluster.

Another metric is the time (more precisely the CPU-time) it takes to deserialise the data. Because of the heavily compressed nature of our data, decompression and deserialising one epoch-data takes roughly 5 sec.

The fourth metric is the CPU-time it take to process the deserialized data, analyze it and produce a result. This is very fast (compared to deserialisation), let’s assume 0.01 sec. at most.

Note: we are not taking into account the impact of storing the data in the cluster (remember that events blobs are being stored every second) because it’s impacting the system the same way in both external processing and MapReduce.

Metrics when doing external processing

When doing standard data processing as seen in the previous part of this blog series, one epoch-data is fetched out from Riak, and deserialised and processed outside of Riak.

External bandwidth usage

The external bandwidth usage is high. For each query, the epoch-data is transferred, so that's 20 queries times 70MB/s = 1400 MB/s. Of course, this number is properly spread across all the nodes, but that's still roughly 1400 / 30 = 47 MB/s. That, however, is just for the data processing. There is a small overhead that comes from the clusterised nature of the system and from gossiping, so let's round that number to 50 MB/s per node, in external output network bandwidth usage.

Internal bandwidth usage

The internal bandwidth usage is very high. Each time a key value is requested, Riak will check its 3 replicas, and return the value. So 3 x 20 x 70MB/s = 4200 MB/s. Per node, it's 4200 MB/s / 30 = 140 MB/s

Deserialise time

Deserialise time is zero: the data is deserialised outside of Riak.

Processing time

Processing time is zero: the data is processed outside of Riak.

Metrics when using MapReduce

When using MapReduce, the data processing code is sent to Riak, included in an ad hoc MapReduce job, and executed on the Riak cluster by sending the orders to the nodes where the epoch-data related data chunks are stored.

External bandwidth usage

When using MapReduce to perform data processing jobs, there is certainly a huge gain in network bandwidth usage. For each query, only the results are transferred, so 20 x 10KB/s = 200 KB/s.

Internal bandwidth usage

The internal usage is also very low: it's only used to spread the MapReduce jobs, transfer the results, and do bookkeeping. It's hard to put a proper number on it because of the way jobs and data are spread on the cluster, but overall it's using a couple of MB/s at most.

Deserialise time

Deserialise time is high: for each query, the data is deserialised, so 20 x 5 = 100 sec for the whole cluster. Each node has 10 CPUs available for deserialisation, so the time needed to deserialise one second worth of data is 100/300 = 0.33 sec. We can easily see that this is an issue, because already one third of all our CPU power is used for deserialising the same data in each MapReduce instance. It's a big waste of CPU time.

Processing time

Processing time is 20 x 0.01 = 0.2s for the whole cluster. This is really low compared to the deserialise time.

Limitations of MapReduce

As we've seen, using MapReduce has its advantages: it's a well-known standard, and allows us to create real-time processing jobs. However it doesn't scale: because MapReduce jobs are isolated, they can't share the deserialised data, and CPU time is wasted, so it's not possible to have more than one or two dozens of real-time data processing jobs at the same time.

It's possible to overcome this difficulty by caching the deserialised data in memory, within the Erlang VM, on each node. CPU time would still be 3 times higher than needed (because a map job can run on any of the 3 replicas that contains the targeted data) but at least it wouldn't be tied to the number of parallel jobs.

Another issue is the fact that writing MapReduce jobs is not that easy, especially because — in this case — it’s a prerequisite to know Erlang.

Last but not least, it's possible to create very heavy MapReduce jobs, easily consuming all the CPU time. This directly impacts the performance and reliability of the cluster, and in extreme cases the cluster may be unable to store incoming events at a sufficient pace. It's not trivial to fully protect the cluster against MapReduce misuse.

A better solution: post-commit hooks

We explored a different approach to enable real-time data processing on the cluster that scales properly by deserialising data only once, allows us to cap its CPU usage, and allows us to write the processing jobs in any language, while still bringing the code to the data, removing most of the internal and external network usage.

This technical solution is what is currently in production at Booking.com on our Riak events storage clusters, and it uses post-commit hooks and a companion service on the cluster nodes.

We'll explore in detail this solution in the next blog post, so stay tuned!

Notes

[1] Using MapReduce on Riak is usually somewhat discouraged because most of the time it's being used in a wrong way, for instance when performing bulk fetch or bulk insert or traversing a bucket. The MapReduce implementation in Riak is very powerful and efficient, but must be used properly. It works best when used on a small number of keys, even if the size of data processed is very large. The fewer keys the less bookkeeping and the better performance. In our case, there are only a couple of hundred keys for one second worth of data (but somewhat large values, around 400K), which is not a lot. Hence the great performance of MapReduce we've witnessed. YMMV.

Have you ever dreamt of embarking on a voyage through space with the USS Enterprise?

The date is October 1st, 2015. For the last 3 hours I’ve been researching as much as I can about Mars occasionally stopping to drool over those beautiful red-tinted landscapes. I’m working on an email campaign titled Add Mars to your travel bucket-list — funny for a designer at Booking.com, “Planet Earth’s #1 Accommodation Site”.

First, let me tell you how I got myself into this project. This morning I was riding the metro to work. I’m in my own world, head tilted downwards and browsing on Flipboard ¹ to get my daily dose of news. That’s when it found me: NASA Confirms Evidence That Liquid Water Flows on Today’s Mars². Water! On Mars?! Mind blown. I devoured the entire article, washing it down with anything else I could find about Mars — hydrated salts, perchlorate, and various other terms that sounded cool (but were too scientific for me to grasp). I suddenly had this longing desire to visit that place, to explore, to get my boots covered in red dust.

At Booking.com, ideas flow freely and come, mostly, from the ground up. Everyone is encouraged to contribute ideas and to follow through with them. Even so, when I stole a few minutes from the Product Owner of Email Marketing to pitch a travel campaign for Mars, her reaction was better than I could have anticipated. Not only did she agree to change our Sunday campaign, but she rearranged her whole day so that we could start searching for otherworldly holiday destinations right here on Earth. Here’s some of the gems that we found:

Atacama, the driest desert on Earth. This is where NASA tests some of it’s rovers before they fly to Mars.

The iron-stained Rio Tinto river bears an uncanny resemblance to Mars’ hue.

These places are scattered all over the globe, some only 3 hours away from our office here in Amsterdam.

Once the list was complete, we took the idea to our copywriters and they eagerly jumped on board. After that, all we needed was a back-end developer and 14 hours later we’d created an email and this landing page from scratch.

My hand was shaking as I hovered with the cursor over the ‘send’ button. What we’d created was to be shared with millions of people out there. Happily, the response carried some really positive energy — overwhelming and humbling at the same time.

Today I look to the seven-year-old me, the one who had a dream to achieve something great (only knowing it would have to involve a keyboard somehow). I know he would be smiling — against all odds his dream came true. I’m grateful for this day; even more so because it’s not the first time I’ve felt like this while working at Booking.com.

All of this was possible thanks to an open environment and stellar talent. Even with such a fast-growing department, individuals are still trusted to make the best decisions for the company. We’re constantly encouraged to think outside the box, to take chances, and discover new possibilities to grow our business.

Without personal contributions we wouldn’t be standing here today. So I want to challenge you to step back for 15 minutes and let your imagination fly. If sending people to Mars wasn’t too crazy an idea for a company specializing in accommodations on Earth, then what can you come up with to positively impact your company?

If you would like to experience the same feelings of freedom and achievement, join Booking.com. We can always use more dreamers :)

Join booking.com

In this post we will discuss Progressive Web Apps and Service Workers. How can they help modern-day mobile web users, and how are we experimenting with them at Booking.com? We will share some challenges we've encountered, as well as some of our learnings.

What is a Progressive Web App?

A Progressive Web App (PWA) is a term Google coined to describe its prospect of app-like web experiences, in which web pages are able to offer many features once deemed app-only—connectivity control, push notifications, home screen icons, and the like.

Before this initiative, some of the features in discussion were already available for mobile web users (although to a limited extend):

Add to home¹ screen² (requires manual actions)
Fullscreen mode³
Application Cache for offline access⁴
Notifications API⁵

Web pages, however, are still not the first choice, when it comes to delivering the best possible experience on a mobile device (despite being more discoverable in search engines and potentially saving the nuisance of downloading and installing megabytes, especially important for first-time visitors and visitors in 2G/3G connections). All too often do we see websites adding banners or interstitial popups⁶, begging users to download their apps, even going so far as to drop their mobile version completely⁷ (only to be resurrected⁸ 5 months later). The justifying arguments that recur: native apps run more smoothly and have better means to re-engage with customers, and the web environment simply lacks graceful fallbacks in flaky network conditions.

A Progressive Web App addresses all these issues except the rendering performance part. Building a Progressive Web App does not force you to drastically change your current front-end architecture or the way your work; it only gives you a set of tools to enhance the web experience progressively. At the end of the day, you'll be able to have:

A home screen icon that opens the website in fullscreen
Native dialogs to let users add your app to their home screens with one click
A fast and always-usable site even in flaky network connections
Push notifications just like native apps

Most of these features are made possible by service workers.

What is a Service Worker?

Service workers essentially act as proxy servers that sit between web applications, and between the browser and network (when available). They are intended to (amongst other things) enable the creation of effective offline experiences, intercepting network requests and taking appropriate action based on whether the network is available and updated assets reside on the server. They will also allow access to push notifications and background sync APIs. - MDN

In short, a service worker is an asynchronous background thread that takes control of all network requests in a page.

Quick Facts

Service Workers run in a different context, thus have no access to DOM elements or JavaScript variables in the main thread
For security reasons the client page (the main thread) must be in https and the service worker script must be in the same origin, but all requests originated from that page can be intercepted by service workers even if they are not in https or served from a different domain
A CacheStorage is provided in the worker so that you can store server responses (including headers and response body) locally, and serve them to future requests.
Server responses can be forged at the client side if necessary.
Everything is asynchronous, and most APIs return a Promise

Browser support

For now, only Chrome, Firefox and Opera have adequate support for service workers. For mobile devices, that means only Android is supported. Since features like homescreen icons and push notifications are integrated in the OS, the whole Progressive Web App initiative really depends on how enthusiastic OS vendors are about it.

Regarding service workers, Apple's attitude is:

People think they want it, some of them actually do want it. We should probably do it.⁹

(it seems, then, we won't wait for too long before service workers are available in iPhones.)

For a detailed compatibility table of all features of service workers, check out this document: Is ServiceWorker ready?

What can Service Workers do?

The ServiceWorker API provides very granular methods for developers to intercept requests, to cache and forge responses, opening doors for all kinds of interesting activities like:

Offline access to certain pages (an order confirmation, an e-ticket, etc)
Precaching assets based on predictions of next user actions (predictions do not rely on service workers per se, but cache manageable can be more programmable with service workers. You can even introduce an expiration time or the LRU algorithm if you want)
Serving a cached version when it takes too long to load some resources
Rewriting URLs to always be requested with a canonical url¹⁰

Check the Offline Cookbook for more details about the caching strategies.

In addition, service workers are also used for arranging background communication with servers (think of it as a "service"). Features like push notifications, background sync, task scheduler all depends on service workers in some extend.

Service Workers in Action

Now, let's get our hands dirty and get to grips with the service worker in action.

Registration

Since service workers run in a different context, you'll need to put the code for the worker in a separate file, then register it in the client page:

if('serviceWorker'innavigator){navigator.serviceWorker.register('service-worker.js',{scope:'./'}).then(function(){if(navigator.serviceWorker.controller){console.log('The service worker is currently handling network operations.');}else{console.log('Failed to register.');}});}

This snippet registers a service worker with the file service-worker.js. Once registered, code in this file will be able to control all requests originated from any page within the scope parameter.

By default, the scope is the base location of the service worker script. For example, if you registered "/static/js/serviceworker.js", then the default scope would be "/static/js/". The script itself must be within the same origin as the client page, so it's not possible to serve service worker scripts with CDNs in different domains. But it is possible to override the scope to be outside of the script's base location:

navigator.serviceWorker.register('/scripts/service-worker.js',{scope:'/'})

This code enables the service worker to control all pages under the root path of the origin ({ scope: '/' }). But you'll need to add an extra response header Service-Worker-Allowed to make it work.

For instance, in an nginx configuration, it can be done like this:

Server{listenwww.example.com:443ssl;...location/scripts/service-worker.js{add_header'Service-Worker-Allowed''/';}}

(Note that this header is added for the service worker script itself, not the page it was registered to.)

Inside the worker

Once registered, a service worker will reside in the background intercepting all requests originated from its client pages and staying active until being unregistered.

The script runs in a context called ServiceWorkerGlobalScope¹¹. Several global variables and methods are available in this context:

clients - Information about client pages, used to claim control over them
registration - Represents the state of the registration
cache - The CacheStorage object in which you can store server responses
skipWaiting() - Allowing registration to process from waiting to active state
fetch(..) - Part of the GlobalFetch API, also available in the main thread
importScripts(..) - Import JS scripts synchronously, ideal for loading a service worker library

The Google Chrome team has provided a nice high-level library¹² to help you handle service worker tasks. It ships with a router for expressively applying common caching patterns to different resources, as well as a toolkit for precaching and namespaced cache management. It is highly recommended to use this library if you want to build something production-ready; it saves you a lot of work and is also a good start for you to get familiar with basic concepts in a ServiceWorker. Check out the recipes for example usages.

If you are really after the details, refer to MDN Service Worker API document and pay extra attention to CacheStorage¹³, and FetchEvent¹⁴.

Service Workers at Booking.com

At Booking.com, we are always open to new technologies, and encourage any innovation that improves customer satisfaction. We are currently working closely with the PWA advocate team from Google on applying some of the core features of Progressive Web Apps to our mobile website to see where it helps our customers.

Booking.com Home Page

Having service workers installed for users is relatively easy—you simply need them to be using a supported browser (currently this means using Chrome in Android). The real challenge, however, lies in how to introduce meaningful features while carefully measuring the impact. At Booking.com, we do every customer-facing project in A/B test experiments, and try to achieve things in the "smallest viable steps." The purpose is to ship the right things as fast as we can. Even for something as holistic as a Progressive Web App, we work in small steps in order to tackle issues one by one, and learn things quickly.

We have gathered some important learnings on this topic. What follows are some of the learnings we found which might be interesting to the general public.

Caching Strategy Examples

Booking.com confirmation page with Chrome DevTools open

The Offline Cookbook¹⁵ summarized a few caching strategies for different use cases.

cacheFirst - Serve cache if it exists, requests will still fire, and new responses will update the cache
cacheOnly - Respond with the cache only, never fire the actual request
networkFirst - Always try fetching from the network first and save the latest successful response into the cache, which will be served when the network fails
networkOnly - Never uses local cache

Let's see some examples of how to apply each of them in real life.

For static files that never change, we can safely serve them with "cacheFirst":

toolbox.router.get(/static\/(css|js|images|img)\//,toolbox.cacheFirst,{cache:{name:'static-files'}});

They seldom change and even if they do, we would've updated the URLs. One might ask, what's the use of this technique if we already set the expiration date in the headers? A service worker gives you more granular control over how much cache you want to store and when to expire them. For instance, sw-toolbox provides very easy configurations for maxEntries¹⁶ and maxAgeSeconds¹⁷.

For ordinary HTML documents, we can use "networkFirst":

toolbox.router.get(/\/(confirmation|mybooking|myreservations)/i,toolbox.networkFirst,{networkTimeoutSeconds:10,cache:{name:'booking-confirm'}});

We configured the networkTimeoutSeconds parameter here. If it is acceptable to show this page to offline visitors, then it must be also acceptable to offer the cached version for users with very slow network connections and save them some waiting time. But of course, the length of the timeout seconds depends on your type of business and the common connectivity quality of your users.

For requests used for user behavior data collection, you might want to use "networkOnly":

toolbox.router.any(/www.google-analytics.com/,toolbox.networkOnly);

There's no point to return cache for a tracking request, right? If the request fails, it fails. If you want, you can even monitor the status of a tracking request, and resend it when it fails. This won't be possible if (somehow) the cache in service workers kicks in.

Local Shortcuts

Wouldn't it be nice if users can save a permanent link in bookmarks which will always redirect them to the last booking confirmation they saw?

Let's add a custom handler for the confirmation page:

toolbox.router.get("/confirmations/(.*)",function(request,values,options){varurl=request.url;varpromise=toolbox.networkFirst(request,values,options);varconfirmationId=values[0];if(confirmationId){// when the request finishespromise.then(function(response){if(!response||response.status!==200)return;self.caches.open('last-confirmation').then(function(cache){// save a 302 Redirect response to "/confirmation"varredirectResponse=newResponse('Redirecting',{status:302,statusText:'Found',headers:{Location:url}});cache.put('/confirmation',redirectResponse);});});}returnpromise;},{networkTimeoutSeconds:10,cache:{name:'confirmations',}});toolbox.router.get('/confirmation',toolbox.cacheOnly,{cache:{name:'last-confirmation'}});

Each time users visit a confirmation page, we will return the response as normal, with the strategy "networkFirst". But in addition to that, we forge a 302 redirect response locally, pointing to the current url, then save the fake response in a cache storage named last-confirmation with URL key /confirmation.

We've also added a rule in the router for this path and this cache storage, so that the next time users visit the URL "/confirmation", they will always be redirected to the last confirmation page they visited.

The forged response was put into a separate cache storage namespace, and is served with strategy cacheOnly. Because, apparently, the URL is only valid locally. We certainly don't want to mix it with normal requests.

The Secure Domain Problem

To protect users' data, all parts of our booking process and user account management pages are served via HTTPS, under a separate domain—"secure.booking.com", instead of "www.booking.com"—the one used for public content such as the Search Results and Hotel Details page.

You can't register one service worker across two different domains however, even if they are subdomains of the same root domain. And (for now at least) there's no way to let two service workers communicate with each other.

What if you want to pre-cache assets for secure.booking.com when users are still in www.booking.com, or the other way around? We have a lot of people jumping between two domains, especially when they are making a reservation. Also, with important functionalities spread across different domains, a service worker for one single domain simply cannot offer an uninterrupted offline experience.

Because of this, we are unifying all basic functionalities under one domain and this will give users full HTTPS access to their whole Booking.com journey. Meanwhile, experts of the Service Worker Specs group are working on a new API called "foreign fetch"¹⁵, which will give service workers authorities to intercept any requests for resources within their scopes (as defined when they were registered). These requests may be originated from any page, even if the page is under another domain.

Final Thoughts

The ServiceWorker API targets a long-standing problem for the mobile web—connectivity. It has the potential to make user experience bearable even when connectivity is bad. It empowers modern web apps with the ability to engage users in more intimate ways, and definitely increases web apps' competitiveness over native ones.

The vision of Progressive Web App is nice, but for a large-scale website steering at a very high speed, you can't implement everything and ship them in one go. Constantly experimenting, learning, and improving things with small steps, is the key to success.

Resources

Development and navigation in a large codebase can prove quite the challenge. A common tool when tackling such a large codebase is an Integrated Development Environment (IDE). Unfortunately, the main programming language at Booking.com, Perl, has limited IDE support. This is the story of the joint effort with the open source community to address this need of IDE support for the Perl programming language.

The Search

In the past I had worked on large projects in a Java-based environment and learned the importance and the value of a good IDE supporting you in your daily tasks, while helping you navigate your way through the endless lines of code in your projects. When I first started to work at Booking.com (about 2 years ago), I was new to Perl and began searching for an IDE for Perl.

I started with Perl oriented IDEs and moved to plugins for Eclipse and IntelliJ, but these attempts came up empty.

Most of the solutions only provided syntax highlighting while others suggested basic text-based or token-based autocompletion. None of them provided full integration in which the IDE understands the relationships between variables, subroutines, and their arguments and packages. I also wanted support for patterns like inheritance.

For a while I resorted to working with Sublime Text but even Sublime Text was simply an enhanced text editor rather than a complete IDE.

_{Searching for a package is a breeze with the 'Navigate to Class' feature}

The Seeding Grounds for Innovation

Luckily at Booking.com we have hackathons (spanning across multiple days). They allow us to experiment with ideas and projects. They enable the business to explore various initiatives which might include adding new tools to making the workflow more effective, exploring new systems or tackling any of the issues for which you normally don't have time in your daily work. I took this opportunity to start working on a proper plugin to provide us with a full IDE for Perl.

_{With referencing: you can click on a package and open it's file or click on a subroutine call to go to it's declaration}

Starting from Scratch

While I encountered some discussions regarding the challenges of parsing perl or expressing it with BNF (a notation technique often used to describe syntax for programming languages), I knew that with enough resourcefulness the majority of use cases can be covered with regular expressions.

Since I was familiar with IntelliJ's capabilities and its plugin development workflow, I decided to develop a plugin that will integrate into the custom languages support, that will make up for the limited support it offers Perl right off the shelf.

One of the challenges when parsing a language (especially a dynamic one like Perl) is that it requires a lot of iterations. Every time you add one thing, it breaks two other things. However, after several hackathons and continuing to work on the plugin at home, I managed to cover most use cases.

I also added many important features like:

Object Oriented autocompletion for packages (including inherited subroutines from parent packages)
"Go to" implementation of a package by clicking on an icon in its row
Searching by package name
Syntax highlighting
Creation of Perl project/module and SDK selection
Support for running .pl files
Subroutines autocompletion with arguments

_{Scalars, subroutines (and their arguments), packages and inheritance packages - all get autocompleted with the Perl intelliJ Plugin}

First Impressions

I showed the plugin around the company and several developers started using the plugin and contributing to the project. This led to a substantially increased development velocity for the plugin, which enabled adding more functionality to the project.

As time passed, I started to realise that without a full BNF there will always be some functionalities in IntelliJ that the plugin could never support (creating references between the various components of the language, syntax error checks, etc.).

I also knew that the more this project would grow, it would need more time and work than a single person could provide. Knowing Booking.com supports employees sharing and contributing to the open source community, I uploaded the project to GitHub and began to roam the IntelliJ plugin forums.

_{MySQL queries autocompletes databases, tables, and columns}

Working with the Open Source Community

Shortly after releasing the plugin, I was contacted by Alexandr Evstigneev, a talented developer from the open source community, who had already created a similar plugin.

Since Alexandr managed to express nearly the entire Perl language in BNF in his plugin and the Booking.com plugin already contained so many features, we knew we could benefit from merging our projects. I also saw this as a great opportunity to give back to the open source community and started migrating our code to his repository, feature by feature.

After a couple of months of integration, along with more people contributing from the open source community, we got IntelliJ to work as a fully pledged Perl IDE that understands Perl as a language. Referencing, Find by Usage, Going to subroutine declaration, Database Integration and autocompletion, Package Hierarchy - these (and more) are all part of the support this plugin provides from within IntelliJ.

While Alexandr continued to work heavily on the BNF structure and adding features with other contributors that joined the GitHub project, I presented the plugin at an AmsterdamX.pm Perl Mongers group meetup for English-speaking Perl developers in Amsterdam and The Netherlands. The presentation generated great feedback and excitement that motivated us to continue our work on the plugin. You can install the plugin directly through IntelliJ's plugin menu (look for Perl5 Support).

_{the GitHub project of the plugin}

Conclusion

Today, the project belongs to the community and is managed by Alexandr, the main contributor. The work on this project during our hackathons with the open source community has been fruitful and beneficial both for us at Booking.com and for the community as a whole.

Looking ahead, there are still more features to be added like Mason support, enabling the client debugging and server debugging. So if you're interested in the plugin, you're welcomed to join the development effort at GitHub!

Everyone who starts working at Booking.com gets the technology talk. “We don’t want to hold you back,” they say. “We don’t want to restrict your creativity. You’re free to use whatever software you need to get the job done — we’re so relaxed about your technology choices we’ll even let you choose between a MacBook and a Dell!”

So there you are, on your first day, sitting at your new desk and clutching a brand new MacBook (what? I prefer shiny things), trying to decide what to install first. Because it’s true — we don’t care what code editor you prefer, whether you’re an Outlook person or a Thunderbird person; Photoshop or Sketch, Powerpoint or Keynote, GitHub or the command line… We really don’t care. Productivity is the priority.

And for the first few weeks (maybe even the first few months), that’s great. In fact, it’s awesome. But when you’re working in a department of over 1,000 people with a technology stack that encompasses at least half-a-dozen different languages – not to mention interacting with a broader company over 10,000 strong – eventually you’re going to run up against a problem that can’t be solved with existing tools.

Where other large organizations might decide it’s time to start looking for outside vendors to plug a hole in the process, at Booking.com we live by a simple rule: See a problem, own a problem. When we find an issue, it’s time to get hacking.

A Few Small Things We Built In Our Spare Time

Like many other companies built around technology, Booking.com provides a space for developers and designers to tackle those annoying issues that don’t get attention in regular “Hackathon” sessions. For us, that means forming small teams of sympatico colleagues for a couple of days each month to scratch a communal itch – either to improve one of our existing systems or developing something that’s missing but would make everyone’s lives easier.

Some of the things we have come up with (either during hackathons or individually, because we encountered the same damn problem too many times) are relatively simple command line tools, shortcuts that let us reduce the amount of time we spend writing code to free up time so we can focus on what matters. A few examples:

Bash scripts

Boo, created by client-side developer Ricardo Tomasi, is a Grunt-like set of tools designed to allow us to work on code locally while testing on remote servers. It handles automatic uploads of changed files, auto-runs concatenation scripts or server restarts, and can even tell us whether a particular commit has been deployed to the live site.
Database engineer Simon Mudd created the queryprofiler and ps-top tools to make it easier for our DBAs to measure query performance, and to see in real time exactly what was going on in the depths of MySQL. Both have been released publicly.
Ever looked at a git history and wished it would just tell you who the right people are to talk to about a file? Eric Herman’s owner.pl script does just that, identifying the core contributors of any file.
mysqly is a wrapper around the MySQL command line prompt that automatically handles user authentication and connects you to the correct cluster of database machines.

Browser Extensions

But bash scripts, no matter how complex, just scrape the surface of what’s available to a motivated engineer determined to do no more work than they absolutely have to. Browser extensions are another great option when looking to shave precious seconds from repetitive or common tasks (that foosball isn’t going to play itself, after all).

Here are some of the browser extensions and other related bits and pieces we have built for ourselves:

Running thousands of A/B tests across a multi-lingual website can often involve some delicate juggling of cookie variables and query parameters. Deepak Gulati'sSilver Hammer was the first in a series of Chrome extensions aimed at reducing the pain of development and testing by allowing us to easily toggle tests and languages on or off, as well as to jump quickly between our development, staging and live environments. Eventually it was superseded by B.quick by Marcelo Oliveira and Renato Costa, which in addition to the original features also hooked into our server-side debugging mechanism. This made the development and testing of new features much easier.
For our Customer Service department, we built the Blue Phone Extension, a handy widget that not only allows our call centre staff to interact with it as if it is their physical phone, but also integrates with our backend job queue management system. (Look out for a longer write-up on this by client side developer Mat Swainson, coming to this blog in the near future.)
Although not strictly an extension, the B.Home replacement ‘New Tab’ page for Google Chrome, by Wesley Souza contains a summary of information relevant to the user, plus useful links to common tools and monitors.
Developed by Aziz Yuldoshev, Blueprint is an extension for Chrome devtools which adds the ability to toggle experiments on or off, as well as search through all available server-side variables and arrays.
Anyone who has ever worked on a complex enterprise-level site will not be surprised to learn that we have many hundreds, possibly thousands, of different templates. It can often be tricky to figure out where a particular piece of the page was actually generated, so Developer Eli Abramovitch whipped up a lengthy but incredibly useful JavaScript bookmarklet that allows us to click on any part of the page to find the responsible template source.

Terminal goodies

Terminal one-liners and browser extensions can achieve some extremely cool results, but sometimes they’re just not enough. That’s when we break out the mini web apps and standalone services. Some of the useful things that have resulted include:

Booker, an iOS app that integrates with our intranet and SAP database to let you find colleagues in the main Amsterdam offices. Blueprint maps of the building’s floors and desk layout let you pinpoint where to find the person you need, and you can also book meeting rooms from within the app.
Lingo is a tool for managing copy translations. As the world’s largest translator of content, we needed a flexible tool that enabled our copywriters to effectively communicate with specialists in over forty different languages… so we built one!
A room in a hotel might seem like a simple thing to model, but in reality there are many different moving parts — hotel policies, room policies, group sizes, child ages, facilities, prices and so on — so finding and testing every single edge case is a tough job. Give.me is our internal tool for quickly locating properties that fit a specific profile.
Blinchik, Artem Tyurin and Aziz Yuldoshev, is a plugin for the Sublime text editor that groks our heavily modified template language, flagging up syntax and stylistic errors before they become a problem; developers Valerii Iatsko and Esteban Beltran also created a similar plugin for the Atom editor. (And speaking of Atom, developer Angelo Di Donato added HQL support to Atom too.)
Recently we experimented with Slack as an internal communication tool within the Technology department, in addition to our existing Jabber platform. During the transition, we needed a way to communicate across platforms, so senior developer Manuel Ceron built Slabber, a Jabber-to-Slack posting tool, so that no critical communications were missed due to people using the wrong chatroom to share important gif-based communication.
It’s not just web developers who need help. Android developer Aliaksandr Jeihala developed a plugin for Android Studio to enable our mobile developers to easily monitor the state of experimentation on that platform, integrating with our testing system to flag whether tests are currently running or need to be removed.
Although we have a plethora of OSX and web-based helpers, Linux and Windows users aren’t completely left out in the cold. Thomas Shone created an Ubuntu tray app named Blerty to provide handy access to tools for those users, while Dan-el Khen built fsync for Windows.
Since moving from CSS sprites to font icons, we discovered the need for a quick reference guide for our icon set. Designer Michel Ferreira whipped up a fancy searchable list, complete with light/dark colour scheme options and automatic copying of icon names to the clipboard.
The exceedingly well-named “Where Is My Table?” tool, created by Jonathan Zhang, is a search engine for databases that allows us to search for tables based on their name, columns or schema.
For new starters, configuring a new laptop is often an unnecessary time-suck. Ricardo Tomasi and Mauricio Wolff created Bookstrap, a Ruby-based mini-script that handles the download and installation of the most commonly used software and tools a new hire might need to get to work.

.dotfiles

Finally, a special mention has to be made of our “dotfiles” git repositories. These per-user repos contain our personal bash preferences, aliases, and any other bits and pieces we frequently use while working. The clever part? Whenever we login to any of our servers – even ones that we’ve never used before – our personal set of files is copied to the server. This provides us with an instant set of familiar scripts and shortcuts wherever we need to work. (Want to know more about this? Check out our previous blog post on the subject.)

Of course, this is not even close to a complete list. New tools and programs have been developed since I started writing this article, and I’m sure most coders at Booking.com have some sneaky secret shortcuts they’re keeping to themselves. So if your boss or company hasn’t yet seen the light and embraced the creative chaos of a regular hackathon, why not propose that you start?

(You might have noticed that we’re quite fond of the letter B when it comes to christening our creations, and that’s not even counting the numerous B.branded laptop stickers adorning every available surface, such as b.proud, b.awesome, b.epic, and so on. Coming up with a new word that adequately describes your creation and that also starts with a B can sometimes be the hardest part of the process. Hmm. There’s probably an app idea in there somewhere…)

When events sizes started to matter

In the beginning, four long years ago, simply having every data set in Hadoop was good enough. We were taking baby steps into the big data world and had a limited user base migrating from other sytems into the big data clusters, enabling them to execute heavy queries on large data sets in a timely manner. As the number of users and size of the data sets grew in size, we were challenged by new performance problems.

The majority of those data sets were all sorts of server-generated events stored in JSON — a format which has since become the norm for ease of use and development. We had gone the typical route followed by all NoSQL new starters in allowing our developers maximum flexibility and get rid of nearly all schemas (which had never really existed for these events).

To make this bunch of JSON objects easier to query and process, we used Hive¹, making a big partitioned table with only a few columns for UUID², datacenter IDs, timestamps and a few other things, plus a very fat column containing the whole JSON. Using Hive was an obvious choice for us as we already had many MySQL tables imported there for the analysts to use. Putting the events in the same pool allowed for some powerful scenarios, where you could easily join anything to anything. We could also use Perl and all the business logic we wrote in it directly in Hive thanks to the TRANSFORM construct ³.

This was already billions of events per day coming from a wide range of application types and servers, the bulk of it coming from the web and mobile front-ends. Traffic kept growing steadily and more event types were added. Individual records became fatter, too. Why limit yourself to a few metrics when you can store everything and keep it forever? Analysts and developers could, and would, scan ranges of several weeks, or even months, in a single Hive query — and everyone was pretty happy.

It worked quite well for a long time: Hive would hide the absurdly massive resources needed for querying petabytes behind a friendly face, and nobody (except the handful of people in charge of maintaining and expanding the whole thing) would know about it. But, due to this inflation combined with the rapidly growing number of users, queries soon started getting much slower.

Cool and simple ideas

We scratched our heads for a long time. How could we make it as efficient as it used to be, without imposing a fixed schema? How could we have the tons of scripts and queries that rely on these ugly blobs still work, but regain the efficiency they once had? All the papers that we read and the experience of the big players in the industry told us one thing: you definitely need a schema for readability by analysts (at the discovery phase) and for efficient querying.

Imposing this is also at odds with how things are done at Booking.com. We always go out of our way to make everyone's job easier, and not apply restrictions, even when it implies adopting creative solutions. Some of these may occasionally make your skin crawl, but as long as they do the job, that's what matters. The approach has served us well so far.

So we had this simple idea: instead of splitting the JSONs according to some schema kept in a registry, which would have been the natural thing to do, why not... do nothing? Keep the events as they always were, and live with it. They'd still be usable by most queries, just not the heaviest ones. And for these fat queries, we would make something brand new: a faster ORC⁴ table, with the JSONs split according to the most common use patterns that we see in actual queries.

This means a table whose structure would be driven by the actual needs of our users. It would mutate by itself over time to fit the evolution of these needs, avoiding breaking legacy scripts (by keeping the old table) while allowing new scripts (or those that would need adjustments for performance reasons) to use the new one. A self-mutating table, as silly as it sounds.

What our users were looking for

Getting a full list of all the queries that are run on a cluster wasn't too difficult. We have loads of monitoring scripts, some of which collect data about individual jobs and put them in databases with columns having information about:

user ID
user name
job name
number of maps and reducers
resources used in CPU and MB-seconds
the query (in the case of Hive jobs)
(...)

YARN archives most of this on job completion in one XML file per job, so that's easy enough⁵, and joining this with the extra information we needed wasn't difficult.

Once we had this covered, and true to our usual ways, we went for the simplest thing that could possibly work: figure out what was usually fetched inside the JSONs by extracting all the get_json* calls (and including a faster custom JSON UDF that we've developed) contained in the query strings. For this we used some Perl text processing modules (like Text::Balanced) that are part of the usual junk-processing arsenal. Plugging something in Hive's query parser would have been marginally neater, but the Perl-based solution got the job done in no time.

The extracted data sets gave us a long list of JSON keys in dot notation, like "foo.bar.baz", that would tell us what people were looking for in these events. We had some constraints in mind for the exercise: force some keys that we knew we'd need later to be counted as "seen" even though they were not present in the queries (= a whitelist) and prevent too much fragmentation (foo.bar.baz1 and foo.bar.baz2 should be seen as foo.bar and kept together as a "leaf" JSON fragment). We made a script that would do just that. It checks what the people needed, when they needed it and keeps that resulting list of keys with timestamps (telling when the corresponding query ran) in a database table.

We'd give them everything they wanted

We needed to run something continuously on the stream of real events to:

Figure out which of those keys would return actual results, because everyone makes mistakes in their queries from time to time and it's easy to misspell "bar" for "baz". If we only trusted the queries without checking what they returned, we'd soon find ourselves populating many columns with NULLs and making the schema bloated and unusable.
Figure out what data type the returned data could fit in once ingested in Hive, in order to specify the most efficient storage format for the destination columns at the splitting phase.
Create destination partitions with the proper columns and types, cleanup the schema when needed, and insert the split JSONs into the destination.

A handful of simple tricks

The script that we wrote does all of this in one pass.

Obtain the keys from step #1, append them to the key log table in MySQL and merge the set with the keys found during the previous runs.
Run a Hive query that splits the JSON blobs for the processed hour in tiny fragments, according to this list of keys. Then for every fragment of actual data, figure out what data type it would best fit into (about 8 types, from TINYINT to TEXT). This is done using a special UDF that does both the JSON splitting and the data type check in one go, outputting two columns for each fragment: the data, and it's guessed type. As should be obvious, the guessed data type for a specific key is dependent on what JSON object is being analyzed, so it's really important that this is done for all JSONs and all keys.
Out of the temporary table we just created in the previous step, extract the "fattest" data type that was found for every destination data column / JSON key. This will be the data type for our destination column. This also allows us to detect which keys are just mistakes, as they returned no data at all and their detected data type should be NULL. To figure this all out we run a simple auto-generated aggregate query on the datatype columns. Here's a picture that will hopefully make this clearer (click on the image to see a larger version):
Use this information to modify the destination table's schema, stored in ORC format, on the fly. One of the nice things about ORC is that partitions of a table don't have to all use the exact same schema: the column order has to stay the same, but the data types can vary. When creating the partition and inserting, the data type specified in the table definition is used, so what is basically text will be properly converted to numeric or other types and stored accordingly. But when reading, the partition definition (not the table) becomes authoritative. This means that in a single query hitting several partitions, the same column can provide different data types, but they will be properly cast at runtime. Which means you can have for instance INTs of various widths in the same column spread over several partitions, starting with TINYINT when you began gathering data, all the way up to BIGINT as your IDs increased. And it will just work.
Finally, populate the destination partition with the contents of the temporary table, minus the data type columns.
Update the table definition with per-column comments in JSON format, recording meta information like the first and last time some actual data was seen for the column. This allows us to sort-of nullify some columns that are not used anymore. Since columns cannot be dropped, we do it this way to reuse them for other keys once the time they've been empty is longer than the predefined time window. Efficient in terms of storage space and for keeping a slim schema.
Drop partitions older than the time window. Done.

Was it worth it?

This is the stage where you start questioning the sanity of this whole enterprise. After all, we found ourselves building a table that is mutating by itself on-demand (since it only uses information that users provide through their queries), and there ought to be a million ways this could fail horribly. You usually carefully craft ALTER TABLE DDL statements, not leave it to a script without direct supervision, right?

It turns out it seems to work quite well. Besides, this is "only" derived data, and we could go back to the source in the event of something gone wrong.

The script that orchestrates all this is only a handful of lines long, including loads of comments related to some of ORC's youth issues (i.e. bugs) that prevented us making it even niftier, and the logic is pretty straightforward.

Let's have a look at a query example; this is the type of scan on a partition that gets run routinely. Before the new table appeared, the query would look like this. It uses event_parser, a UDF we wrote to make processing of our JSONs easier and faster. The syntax is even uglier than that of a query using stock get_json_object(), but it runs quicker:

CREATETEMPORARYFUNCTIONevent_parserAS'com.booking.hive.udtf.EventParser';createtablemydb.sample_rawASSELECTevent.epoch,response.body_parameters,response.experiment,response.languageFROMdefault.raw_eventsaseventLATERALVIEWevent_parser(event.json,'http.body_parameters','action.experiment','action.language','action.name')responseASbody_parameters,experiment,language,action_nameWHEREyyyy_mm_dd='2016-04-18'ANDhh=21ANDis_frontend=1ANDaction_name='explorer';

Here's the syntax for running the same query on the new table, undoubtedly a whole lot better:

createtablemydb.sample_flatASSELECTepoch,`http/body_parameters`,`action/experiment`,`action/language`FROMdefault.events_flat_webWHEREyyyy_mm_dd='2016-04-18'ANDhh=21ANDis_frontend=1AND`action/name`='explorer';

In terms of storage, removing the lesser used data sets and switching to ORC⁴+ZLIB instead of RCFile⁶+Snappy brought the volume down by 60%. For the hourly partition we're querying here, it means going from nearly 500GB to less than 200GB. But this is big data, and size doesn't matter that much, as usual (unless you're the one buying the hard drives). What about performance, then?

Here is the CPU time for the old version (the result is a table of nearly 100 million rows):

1 days 1 hours 25 minutes 10 seconds 20 msec

And for the new version:

0 days 2 hours 41 minutes 12 seconds 420 msec

Which is 9,672 seconds vs 91,510, close to a 90% drop. What are we going to do with all these free CPU cycles on our hands?

We may want to keep it

Experimenting was fun, but the only thing that really matters is the benefits for our users and infrastructure. In short:

Query runtimes were divided by about four
CPU time were divided by up to ten
Queries are easier to write and read, and more expressive

Which makes it a nice hack, but first and foremost a successful experiment.

The funny thing is, this was initially meant to solve a short-term problem for a few queries, hence the limited time window to save storage space. But it turns out our users liked the new table so much they didn't want us to drop any old partitions, and started using the new table as a full replacement for the old one.

As has happened many times before, what started its life as a quick hack (in search of a better solution) is going to be sticking around quite a bit longer.

Hive is a SQL-like interface for writing MapReduce Jobs on Hadoop clusters. ↩
See the definition of Universally unique identifier ↩
TRANSFORM is possibly one of the neatest things in Hive, allowing to plug arbitrary code written in any language into a HQL query. This makes Hive become de facto a Hadoop map-reduce platform accessible to any language, in a handier way than using simple HadoopStreaming. More information on Hive transform can be found in the Hive Wiki. ↩
Hive ORC Manual explains the implementation and usage details of the file format. ↩
The jobs statistics and info are not retrieved from the YARN History Server's REST API as you'd normally expect, as it turned out this front-end was unable to cope with multi-thousand-container jobs. Querying the History Server for such jobs made it so unresponsive it couldn't do anything else, including archiving completed jobs data, which turned it into a single point of failure for the whole cluster. Yet another case of Hadoop fun. ↩
See the definition of Record Columnar File ↩

Abstract: Building the best travel experience for our customers in Booking.com often involves solving very challenging problems. One that appears very frequently is the $k$-Nearest Neighbours problem ($k$-NN). In simple words it can be stated as follows: Given a thing, find the $k$ most similar things. Depending on how thing and similar are defined, the problem becomes more or less complex. In this post we’ll treat the case where the things are represented by vectors, and the similarity of two things is defined by their angle. We’ll discuss solutions and present a practical trick to make it fast, scalable, and simple. All of it, thanks to maths.

1. Things and Similarity

Suppose that we are given a database with pictures of handwritten digits, and the task of finding similar digit handwriting. Every picture contains exactly one digit, but we don't know which one. For every given picture we want to find other pictures that contain the same digit written with a similar style.

First, we need to find a representation for the pictures that we can use to operate. In this case it will be simply a vector, the length d of the vector is the number of pixels in the picture, and the components are the RGB values of the pixels. This representation has the advantage of working both at the computational level, but also at the mathematical level.

Second, we need to define similarity. Since the pictures are represented by vectors, we can compute the angle between any two of them; if the angle is small, the vectors point in the same direction and the pictures are similar. On the other hand, if the angle is big, the vectors diverge from each other and the pictures are not similar. More formally, the similarity between two pictures represented by vectors X and Y is given by:

\begin{equation} \label{eq:sim} sim(X, Y) = cos(X, Y) = \dfrac{\langle X, Y \rangle}{||X|| ||Y||} = \dfrac{\sum_i^d{X_i Y_i}}{\sqrt{\sum_i^d{X_i^2}}\sqrt{\sum_i^d{Y_i^2}}} \text{ (1)} \end{equation}

This quantity has a very nice property: It will never be more than 1, or less than -1. Given two vectors, if their similarity is close to 1 then they are very similar; if it is close to -1 they are completely different. 0 is in the middle – not very similar, but not completely different either.

Let’s see this in action:

Fig. 1: Two very similar 4s, their similarity according to equation 1 is 0.85

Fig. 1: Two very similar 4s, their similarity according to equation 1 is 0.85

Fig. 2: Two very different 4s, their similarity according to equation 1 is 0.11

Fig. 2: Two very different 4s, their similarity according to equation 1 is 0.11

Figure 1 shows a graphical representation of the vectors computed from two handwritten digit pictures. These two digits are very similar, and when their similarity is computed by their angle using equation 1 it gives 0.85, which is, accordingly, quite high. On the other hand, Figure 2 shows two quite different numbers; this time their similarity is 0.11, which is quite low but still positive – even though the writing style is very different, both pictures are still a 4.

These pictures were handpicked to illustrate the vector representation and the cosine similarity. We now move on to finding an algorithm that finds similarly handwritten digits.

2. A simple solution

Now that we know how to represent things and compute their similarity, let’s solve the k-NN problem:

For every picture in our database, compute its associated vector.
When the $k$-Nearest Neighbours for a picture are requested, compute its similarity to every other picture in the database.
Sort the pictures by ascending similarity.
Return the last $k$ elements.

This is a very good solution (especially because it works). Figure 3 shows this algorithm in action. Every row shows the top 9 most similar pictures to the first picture in the row. The first line captures very rounded 3s, the second inclined 3’s, the fourth line shows 2’s with a loop on the bottom, and the 5th line shows Z like 2’s. Notice that this algorithm has no information about what digit is in the picture (nor, for that matter, anything about what kind of things the picture has), but it nevertheless succeeds to group by digit, and even by typographic style.

But let’s take a closer look by analysing its computational complexity. Consider n pictures with d pixels each:

Computing a feature vector is $O(d)$. As this is done for every picture, the first step is $O(nd)$.
The similarity function is $O(d)$, and again this happens $n$ times, then the second step is also $O(nd)$.
The third step – sort $n$ elements – is $O(nlogn)$
Finally returning the last $k$ elements could be constant, but let’s consider it $O(k)$.

In total we get:

\begin{equation} O(nd) + O(nd) + O(nlogn) + O(k) \text{ (2)} \end{equation}

$k$ is very small compared to $n$ and $d$, so we can neglect the last term. We can also collapse the two first terms into one single $O(nd)$. Now, note that $logn$ is much smaller than $d$, for example, if we have 10 million pictures with 256 ($d$) pixels each, $logn$ would be 7, much smaller than $d$. That means we can turn the $O(nlogn)$ into another $O(nd)$. Therefore the computational complexity of this algorithm is $O(nd)$, that is, linear in both total the number of images in our database and the the number of pixels per picture.

Fig 3: k nearest neighbours for some pictures. Every row depicts the top 9 most similar pictures to the first picture in the row

Fig 3: k nearest neighbours for some pictures. Every row depicts the top 9 most similar pictures to the first picture in the row

3. Can we do better?

If we do computational complexity analysis, it is natural to ask ourselves whether we can improve. So let’s try.

One idea to consider is to use a heap to keep track of the $k$ most similar items. The heap would never be larger than $k$ so every insertion involves $O(logk)$ similarity computations ($O(d)$), so an insertion is $O(dlogk)$. Since there will be $n$ insertions, in total we get $O(ndlogk)$, which is not an improvement.

We could also try to exploit the fact that we do not need to sort the $n$ elements, just to get the top $k$. The algorithm would be exactly the same, but step 3 would be replaced by applying quick-select instead of sorting. This would change the $O(nlogn)$ term to $O(n)$, that gives $O(nd) + O(n)$ which is $O(nd)$, again, not an improvement.

The last idea we will consider is to use a Space Partitioning Tree (SPT). An SPT is a data structure that allows us to find the closest object to another object in logarithmic time. A priori this seems to be the right solution but there is a problem: SPTs can only operate under certain distance functions, specifically metric distances.

SPTs work with distances, not with similarities. But there is a very close relationship between similarity and distance. In the context of $k$-NN, for every similarity function there exists a distance function such that searching the $k$ most similar items is equivalent to searching the $k$ closest items using that distance function. Just multiplying the similarity by −1 gives such a distance. So now we have a cosine distance that we could use in an SPT, but unluckily this cosine distance is not a metric distance.

A metric distance is a distance that complies the following conditions:

$distance(x, y) \geq 0$
$distance(x, y)=0 \iff x=y$
$distance(x, y)=distance(y, x)$
$distance(x, z)\leq distance(x, y)+ distance(y, z)$

Cosine distance clearly violates the first condition, but this is easy to fix by just adding 1. The second and the third conditions are met. Finally the fourth condition is violated and this time we cannot fix it. Here is an example of 3 vectors that violate the fourth condition:

$x=(1,0), y=(1,1), z=(0,1)$. Then:

$distance(x,y)=1-\dfrac{\langle x, y \rangle}{||x||||y||} = 1-\dfrac{1}{\sqrt{2}}$

$distance(y, z)=1-\dfrac{\langle y, z \rangle}{||y||||z||} = 1-\dfrac{1}{\sqrt{2}}$

$distance(x, z)=1-\dfrac{\langle x, z \rangle}{||x||||z||} = 1$

$distance(x,y)+distance(y, z)= 1-\dfrac{1}{\sqrt{2}} + 1-\dfrac{1}{\sqrt{2}} = 2-\dfrac{2}{\sqrt{2}} = 0.58$

And then:

$distance(x, z) \not \leq distance(x,y) + distance(y,z)$

which proves that condition 4 is not met.

In the following sections we are going to show a trick to overcome this limitation.

4. Maths

Let’s introduce some properties of vectors that we’ll exploit later.

Cosine distance is invariant under Normalization

First, let’s make a few definitions:

Norm of a vector $X$: Denoted by $\lVert X \rVert$ and computed as $\sqrt{\sum_i^d X_i^2}$. This is exactly the geometric length of the vector.
Normalization of vector: Denoted by $\hat X$ and computed as $\dfrac{X}{\lVert X \rVert}$ which is just dividing every component of the vector $X$ by the norm of $X$.

A consequence of these two definitions is the following:

\begin{equation} \lVert \hat X \rVert =1 \end{equation}

Which says that the norm of a normalized vector is always 1. This property is quite obvious, but here is a proof:

$\strut \lVert\hat X \rVert^2=\strut\sum_i^d \hat{X_i}^2 ={\sum_i^d {\dfrac{X_i^2}{\lVert X \rVert^2}}} = {\dfrac{1}{{\lVert X \rVert}^2}\sum_i^ d X_i^2 } =\dfrac{1}{\lVert X \rVert}\lVert X \rVert = 1 $

And since $\strut \lVert\hat X \rVert$ must be positive we conclude that $\lVert \hat X \rVert =1$.

Another consequence is the following:

\begin{equation} \label{eq:invariant} cos(\hat X, \hat Y) = cos(X, Y) \text{ (4)} \end{equation}

In words, this means that the angle between two vectors doesn’t change when the vectors are normalized. Normalization only changes the length (the norm of the vector), not its direction, and therefore the angle is always kept. Again, here is the proof:

$cos(\hat X, \hat Y) = \dfrac{\langle \hat X, \hat Y \rangle}{\lVert \hat X \rVert \lVert \hat Y \rVert} = \langle \hat X, \hat Y \rangle = \langle \dfrac{X}{\lVert X \rVert}, \dfrac{Y}{\lVert Y \rVert} \rangle = \sum_i^d \dfrac{X_i}{\lVert X \rVert}\dfrac{Y_i}{\lVert Y \rVert} = \dfrac{\sum_i^d X_i Y_i}{\lVert X \rVert \lVert Y \rVert} = \dfrac{\langle X, Y \rangle}{\lVert X \rVert \lVert Y \rVert} = cos(X, Y)$

Now since $cos(\hat X, \hat Y) = cos(X, Y)$ then also $1-cos(\hat X, \hat Y) = 1-cos(X, Y)$, which means that the cosine distance between $X$ and $Y$ is exactly the same as the cosine distance between the normalized version of $X$ and $Y$.

From Euclidean to Cosine

The second property we need for this trick is the following:

If $X$ and $Y$ are vectors with norm 1 (unit vectors) then: \begin{equation} \label{eq:euclideaniscosine} \lVert X-Y \rVert = \sqrt{2-2cos(X,Y)} \text{ (5)} \end{equation}

This states that if $X$ and $Y$ are unit vectors then there is an exact relationship between the euclidean distance from $X$ to $Y$ and the angle between them.

The proof:

\begin{equation} {\lVert X-Y \rVert}^2 = X^2 - 2\langle X, Y \rangle + Y^2 \text{ (6)} \end{equation}

Since $X$ is a unit vector then $X^2 = \langle X,X \rangle = \sum_i^d X_i X_i = {\lVert X \rVert}^2 = 1$, likewise, $Y^2=1$ then we have: \begin{equation} {\lVert X-Y \rVert}^2 = X^2 - 2\langle X, Y \rangle + Y^2 = 2-2\langle X, Y \rangle \text{ (7)} \end{equation}

And since $X$ and $Y$ are unit vectors, dividing by their norm is dividing by one:

\begin{equation} {\lVert X-Y \rVert}^2 = X^2 - 2\langle X, Y \rangle + Y^2 = 2-2\langle X, Y \rangle = 2-2\dfrac{\langle X, Y \rangle}{\lVert X \rVert \lVert Y \rVert} = 2-2cos(X,Y) \text{ (8)} \end{equation}

Cosine ranking is equivalent to Euclidean ranking

By looking at equation 5 we can already see that if $1-cos(X,Y)$ is bigger, then $\lVert X-Y \rVert$ must be bigger. That means that if $Y$ gets away from $X$ in the euclidean space, it also does in the cosine space, provided both $X$ and $Y$ are unit vectors. This allows us to establish the following:

Consider three arbitrary $d$-dimensional vectors $X$, $A$ and $B$. (they don't need to be unit vectors). Then the following holds: \begin{equation} \textit{if } 1-cos(X,A)<1-cos(X,B) \textit{ then } \lVert \hat X- \hat A \rVert < \lVert \hat X- \hat B \rVert \text{ (9)} \end{equation}

This equation says that if the cosine distance between $X$ and $A$ is less than the cosine distance between $X$ and $B$ then the euclidean distance between $X$ and $A$ is also less than the euclidean distance between $X$ and $B$. In other words, if $A$ is closer to $X$ than $B$ in the cosine space, it is also closer in the euclidean space.

The proof: We start from the left hand side expression and apply operations to get to the right hand side expression.

\begin{equation} 1-cos(X,A)<1-cos(X,B) \implies 1-cos(\hat X,\hat A)<1-cos(\hat X,\hat B) \text{ (10)} \end{equation} cosine is invariant under normalization (see equation 4)

\begin{equation} 1-1cos(\hat X,\hat A)<1-1cos(\hat X,\hat B) \implies \sqrt{2-2cos(\hat X,\hat A)} < \sqrt{2-2cos(\hat X,\hat B)} \text{ (11)} \end{equation} doubling and taking squared root keeps the inequality

\begin{equation} \sqrt{2-2cos(\hat X,\hat A)} < \sqrt{2-2cos(\hat X,\hat B)} \implies \lVert \hat X- \hat A \rVert < \lVert \hat X- \hat B \rVert \text{ (12)} \end{equation} normalized vectors are unit vectors and equation 5

This is all the maths we need to apply the trick. Let’s see what is it about.

The $k$-NN Trick

The goal of this trick is to find a way to be able to use cosine similarity with a Space Partitioning Tree, that would give us $O(log n)$ time complexity, which is a huge improvement.

The idea is actually very simple: Since cosine similarity is invariant under normalization, we can just normalize all our feature vectors and the $k$-nearest neighbours to X will be exactly the same; but now our vectors are all unit vectors, which means that sorting them by cosine distance to X is exactly the same as sorting them by Euclidean distance to X, and since Euclidean distance is a proper metric we can use a Space Partitioning Tree and enjoy the logarithm of n. Here’s the recipe:

Normalize all the feature vectors in the database
Build a Space Partitioning Tree using the normalized vectors
When the $k$ nearest neighbours to an input vector $X$ are requested:
1. Normalize $X$
2. look up the $k$-NN from the Space Partitioning Tree

6. Experiments

Experimentation is at the core of Product Development at Booking.com. Every idea is welcomed, turned into a hypothesis and validated through experimentation. And Data Science doesn’t escape that process.

In this case, the idea has been thoroughly described and supported with practical examples and even maths. But let’s see if reality agrees with our understanding. Our hypothesis is the following: We can improve the response time of the algorithm described in section 2 by applying the trick described in section 5 guaranteeing exactly the same results.

To test this hypothesis we designed an experiment that compares the time needed to solve the $k$-NN problem using the full scan solution with the time needed by the $k$-NN trick solution. The $k$-NN trick is implemented using two different Space Partitioning Trees: Ball Tree, and KD-Tree.

The database consists of handwritten digits pictures from MNIST. For n ranging from 5000 to 40000 randomly sampled n pictures from the original database; then applied the different solutions to the same sample, computing the 10 most similar pictures for 20 input pictures.

7. Results

The results of our experiment are summarized by Figure 4:

Fig 4: Comparison of the full scan solution (brute force) and the k-NN trick (norm euclidean ball tree, and kd tree) for different database sizes n

Fig 4: Comparison of the full scan solution (brute force) and the $k$-NN trick (norm euclidean ball tree, and kd tree) for different database sizes n

From the chart we can make several conclusions: First, the time complexity of the full scan solution is indeed linear in n as suggested by the blue dots. This confirms the theoretical analysis in section 2. Second, although it is hard to say if the $k$-NN trick based solution is logarithmic, it is clearly much better than the full scan, as suggested by the green and red dots. Third, the Ball Tree based solution is better than KD-Tree solution, though the reason for this fact is not clear and requires further analysis and experimentation. Overall, the experiment strongly supports the hypothesis.

8. The Trap

Every trick sets up a trap, and every gain in one aspect hides a loss in another. Being aware of these traps is key to successfully apply these tricks. Let’s see what trap the $k$-NN trick sets, or, in more technical words, what kind of trade-off are we dealing with?

In the simple solution, before being able to answer a $k$-NN query all we need to do is to compute the feature vectors of each object in the database. On the other hand, when using the trick, before we are able to answer a query we not only need to compute the feature vectors, but also we need to build the Space Partitioning Tree. In the experiment we run, we also recorded the time it takes to be able to answer queries. The results are displayed in Figure 5 and show that the trick-based solutions scale much worse than the simple solution. This means that when using the trick we are trading off query response time with start-up time.

This trade-off must be taken carefully, and for big databases this can have very negative consequences. Consider an e-commerce website that goes down for whatever reason; imagine that this e-commerce uses $k$-NN to serve some recommendations, (a very important yet not critical part of the system). As soon as we fix the problem, we want the system to reboot as soon as possible, but if the booting process depends on the $k$-NN system we fall into the trap – users won’t be able to purchase anything until our Space Partitioning Tree is built. Not good.

This can be easily solved by breaking the dependence using parallel or asynchronous processes to boot different parts of the system, but the simple solution is clearly more robust in this instance, up to a point where we don’t even need to care. The $k$-NN trick forces us to consider this very carefully and act properly. For many applications, this isn’t a bad price to pay for the speed and scalability we get at query time.

9. Conclusion

In this post we described a trick to speed up the standard solution for the $k$-NN problem with cosine similarity. The mathematical rationale for the trick was presented, as well as experiments that prove its validity. We consider this as a good example of a scalability problem overcome by applying elementary maths. This is also a good example of Reductionism: The trick is a reduction from cosine similarity $k$-NN problem to a Euclidean distance $k$-NN problem which is a much more studied and solved problem. Maths and Reductionism are two concepts sitting at the core of applied Data Science at Booking.com, always at the service of the best travelling experience.

Fig 5: Ready-time comparison of the full scan solution and the k-NN trick for different database sizes n

Fig 5: Ready-time comparison of the full scan solution and the $k$-NN trick for different database sizes n

This is the annex to Evaluating MySQL Parallel Replication Part 4: More Benchmarks in Production.

Implementation Details of MariaDB Optimistic Parallel Replication

Rollbacks and Retries

When transactions T1 to T20 are run concurrently by optimistic parallel replication, if T19 blocks T2, T19 will be killed (rolled-back) for unblocking T2 (T2 must commit before T19). In the current implementation, T19 will be retried once T18 completed. It looks like this could be optimized.

I thought that retrying T19 as early as T2 completes could improve optimistic replication speed. Kristian Nielsen, the implementer of parallel replication in MariaDB, was kind enough to implement a patch with more aggressive retries. However, with quicker retries, I got slower results than with delayed retries. So it looks like once a conflict is detected (T19 blocks T2), the probability of another conflict is high, and the gain in retrying T19 earlier is outweighed by the cost of other rollbacks of T19.

DML vs DDL and Non-Transactional Storage Engines

The assumption for optimistic parallel replication to work is that a transaction that causes a conflict can be killed and retried. This is the case for InnoDB DML (Data Manipulation Language: INSERT, UPDATE, DELETE, ...) but it is not the case with MyISAM.

As a transaction involving a MyISAM table (or another non-transactional storage engine) cannot be rolled-back, it is not safe to run those transactions optimistically. When such transaction enters the optimistic parallel replication pipeline, the replication applier will wait for all previous transactions to complete before starting the transaction that cannot be rolled-back. The following transactions could still be run optimistically if they are exclusively using a transactional storage engine (if they can be rolled-back). This means that DMLs that cannot be rolled-back act as a pre-barrier in the parallel replication pipeline.

In MariaDB, DDL (Data Definition Language: [CREATE | ALTER | TRUNCATE | DROP | ...] TABLE, ...) are also (still) impossible to rollback. So they will also act as a pre-barrier in the parallel replication pipeline. Moreover, DDL are also preventing all next transactions to be optimistically applied because a DML is not safe to run at the same time as a DDL on the same table. So, not only DDLs act as a pre-barrier, but they are also acting as a post-barrier.

Different Optimistic Parallel Replication Modes

MariaDB 10.1 optimistic parallel replication can be run in two slave_parallel_mode: optimistic and aggressive. In the optimistic mode, some heuristics are used to avoid needless conflicts. In the aggressive mode, those heuristics are disabled.

One of the heuristics of the optimistic mode is the following: if a transaction executed a row-lock wait on the master, it will not be run in parallel on the slave. The behavior is unclear when intermediate masters are used:

An intermediate master with slave_parallel_mode=none (single threaded) will not have any row-lock wait. So it looks like for a slave of such intermediate master, the optimistic mode would behave the same way as the aggressive mode.
An intermediate master with slave_parallel_mode=minimal (slave group committing) will have a row-lock wait for each group commit. So it looks like for a slave of such intermediate master, the optimistic mode would behave the same as the conservative mode.
An intermediate master with slave_parallel_mode=conservative should generate very few row-lock wait (only for conflicts that will generate retries). So it looks like for a slave of such intermediate master, the optimistic mode will behave mostly the same as the aggressive mode.
The number of row-lock waits is hard to predict on an intermediate master in optimistic or aggressive mode. So the behavior of a slave of such intermediate master is hard to predict.

As we are doing tests on a slave of an intermediate master, the optimistic mode is not very interesting to test. It would generate results similar to the aggressive mode if the intermediate master was running in single-threaded or conservative mode, or similar to the conservative mode if the intermediate master was running in minimal mode. Without a true master running MariaDB 10.1, the only tests that we think make sense are with slave_parallel_mode=aggressive.

This is a good opportunity to remind that intermediate masters are bad for parallel replication. As shown in a Part 1, intermediate master are doing a poor job at transmitting parallelism information from their master to their slaves. The solution presented in the previous post still applies: use Binlog Servers.

Environments

As in the previous posts (Part 1, Part 2 and Part 3), we are using the same four environments. Each environment is composed of five servers. For slave_parallel_mode=none and slave_parallel_mode=conservative, only four of the five servers are needed and are organized as below:

+---+     +---+     +---+     +---+
| A | --> | B | --> | C | --> | D |
+---+     +---+     +---+     +---+

The A to C servers are strictly the same as before. The D server has the same hardware specification as before but it is now running MariaDB 10.1.8 [1]. This means that the conservative results will use the same parallelism information (group commit) as for the tests from Part 3 (we are re-using the same binary logs as the previous tests).

For optimistic parallel replication to work, a MariaDB 10.1 slave must be connected to a MariaDB 10.1 master [2], hence the introduction of a fifth (E) server. For slave_parallel_mode=aggressive, D is replicating from E as shown below:

+---+     +---+     +---+     +---+     +---+
| A | --> | B | --> | C | --> | E | --> | D |
+---+     +---+     +---+     +---+     +---+

The hardware specifications of E are not important because it is only serving binary logs. It was built as a clone of D that was upgraded to MariaDB 10.1. Replication was then started from C with slave_parallel_mode=none. This way, we produced 10.1 binary logs so slave_parallel_mode=aggressive will work on D.

The full test methodology is the same as for the previous tests and can be found in Part 3. The server and database configurations are mostly the same as in the previous tests with the following modifications:

Property	E1	E2	E3	E4
InnoDB Buffer Pool Size	71 GB	162 GB	71 (from 76) GB	57 (from 76) GB
InnoDB Log Size	64 (from 1) GB	32 (from 1) GB	16 (from 4) GB	16 (from 1) GB

The motivations for the above changes are the following:

The InnoDB Buffer Pool Size was reduced for E3 and E4 because we were missing RAM to increase slave_parallel_threads to the number we wanted to test (more threads need more available RAM).
The InnoDB Log Size was increased because checkpointing was a bottleneck during our tests [3].

Results

In the main post, speedup graphs are presented for each of the four environments. Here, the underlying data for those graphs is presented.

The SB, HD and ND notations are explained in the main post.

The first line of the table below shows the time taken for the single-threaded execution with slave_parallel_mode (SPM) set to none. Then, for slave_parallel_threads (SPT) values of 5, 10, 20 and 40, we have results with both non-optimistic (slave_parallel_mode=conservative) and optimistic (slave_parallel_mode=aggressive) executions. Then, for slave_parallel_threads values of 80, 160, 320, 640, 1280, 2560 and 5120, we have results only for optimistic executions. Note that we cannot have meaningful results for non-optimistic runs with slave_parallel_threads greater than 40 because the maximum group size on C was 35 (see Part 3 for more details).

The times presented below are in the format hours:minutes.seconds and they represent the delay needed to process 24-hours of transactions. The number in bold is the speedup achieved from the single-threaded run.

Execution Times and Speedups
		E1		E2		E3	E4
SPT	SPM	SB-HD	SB-ND	SB-HD	SB-ND	SB-HD	SB-HD
	none	7:36.09	4:01.20	3:09.34	1:24.09	10:56.20	7:59.34
5	conservative	5:08.52 1.48	3:52.54 1.04	1:42.23 1.85	1:17.09 1.09	9:14.35 1.18	5:41.16 1.41
5	aggressive	4:56.02 1.54	3:40.33 1.09	1:39.51 1.90	1:14.41 1.13	9:16.49 1.18	6:32.45 1.22
10	conservative	4:29.24 1.69	3:37.36 1.11	1:27.04 2.18	1:12.57 1.15	8:49.29 1.24	5:25.06 1.48
10	aggressive	4:12.49 1.80	3:14.59 1.24	1:23.23 2.27	1:07.14 1.25	8:37.28 1.27	5:58.17 1.34
20	conservative	4:06.02 1.85	3:24.45 1.18	1:20.11 2.36	1:12.32 1.16	8:32.32 1.28	5:14.20 1.53
20	aggressive	3:33.46 2.13	2:51.09 1.41	1:10.53 2.67	0:58.00 1.45	8:06.49 1.35	5:19.40 1.50
40	conservative	4:01.18 1.89	3:21.11 1.20	1:18.11 2.42	1:11.19 1.18	8:26.01 1.30	5:09.45 1.55
40	aggressive	3:11.19 2.38	2:27.48 1.63	1:02.15 3.05	0:50.00 1.68	7:34.28 1.44	4:18.04 1.86
80	aggressive	2:55.15 2.60	2:11.27 1.84	0:57.23 3.30	0:43.48 1.92	7:11.48 1.52	3:20.43 2.39
160	aggressive	2:42.22 2.81	2:02.56 1.96	0:56.24 3.36	0:41.14 2.04	6:42.28 1.63	2:44.52 2.91
320	aggressive	2:41.08 2.83	1:57.40 2.05	0:59.44 3.17	0:43.33 1.93	6:14.19 1.75	2:22.48 3.36
640	aggressive	2:42.52 2.80	1:57.48 2.05	1:09.31 2.73	0:54.56 1.53	5:32.46 1.97	2:06.50 3.78
1280	aggressive	2:43.00 2.80	2:01.12 1.99	1:33.47 2.02	1:23.37 1.01	5:05.29 2.15	2:10.01 3.69
2560	aggressive	2:46.21 2.74	2:04.44 1.93	2:28.25 1.28	2:21.53 0.59	4:46.43 2.29	2:16.07 3.52
5120	aggressive	2:45.39 2.75	2:07.18 1.90	4:54.18 0.64	4:50.55 0.29	4:49.34 2.27	2:26.09 3.28

Graphs during Tests

If you spot something we might have missed in the graphs below, please post a comment. Those graphs include the number of commits per second, CPU stats, Read IOPS and percentage of Retried Transaction for all tests.

Graphs # 1a: E1 Stats - Slave with Binary Logs - High Durability

Graphs # 1b: E1 Stats - Slave with Binary Logs - Relaxed Durability

Graphs # 2a: E2 Stats - Slave with Binary Logs - High Durability

Graphs # 2b: E2 Stats - Slave with Binary Logs - Relaxed Durability

Graphs # 3a: E3 Stats - Slave with Binary Logs - High Durability

Graphs # 4a: E4 Stats - Slave with Binary Logs - High Durability

[1] At the time of the publication of this post, the latest release of MariaDB 10.1 is 10.1.17. Our tests were done with MariaDB 10.1.8 because they were run a long time ago (I am a little embarrassed to be that late in my blog post editing).

[2] In the implementation of optimistic parallel replication in MariaDB 10.1, the master is responsible for flagging DDL and non-transactional DML and to pass this information to slaves via the binary logs. This is why a MariaDB 10.1 master is needed to enable optimistic parallel replication on a slave. This also means that for optimistic parallel replication to work, master and slaves must have compatible storage engines for DML: if a DML is transactional on the master, it must be transactional on the slave. So a master using InnoDB and a slave using MyISAM will not work.

[3] Because the InnoDB Log Size was too small in our previous tests, those tests were run in non-optimal conditions. The results presented in this post should be considered more accurate.

Parallel replication is a highly-expected feature of MySQL available in MariaDB 10.0 and in MySQL 5.7. We already presented benchmark results with MariaDB 10.0 in the previous post of the series; in this post, we will look at a new type of replication introduced in MariaDB 10.1: optimistic parallel replication.

This post has an annex: Under the Hood. Benchmarking is a complex art and reporting results accurately is even harder. If all the details were put in a single article, it would make a very long post. The links to the annex should satisfy readers eager for more details.

Optimistic parallel replication is built on top of the previous implementation of parallel replication. To fully understand optimistic parallel replication in MariaDB 10.1, we must dive back in the implementation details of parallel replication in MariaDB 10.0.

Key Design Decision: In-Order Commit

In Part 1 of the series, we explained how parallel replication works in MariaDB 10.0 and in early version of MySQL 5.7. In short, both MySQL 5.7 and MariaDB 10.0 identify parallelism on the master and send this information to the slaves via the binary logs [1]. However, MariaDB and MySQL differ in the way transactions are committed on slaves.

In MariaDB, transactions run in parallel on slaves are committed in the same order as they appear in the binary logs of the master [2]. So if a slave runs T1 and T2 in parallel and those transactions appear in this order in the binary logs of the master, the thread running T2 will wait for T1 to complete before committing T2 (even if T2 is ready to commit before T1).

By default, a MySQL slave running transactions in parallel is simply committing transactions as they complete without enforcing any ordering. In the above example, T2 will commit before T1. This could (and will most of the time) generate different transactions ordering in the binary logs of the slaves [3][4].

When committing in-order, a transaction that needs to commit first (T1) can be blocked by another transaction that will commit later (T2). It is surprising that such dependencies are generated by transactions committing at the same time on the master, but this can still happen. This is a deadlock situation: T1 is blocked by T2 and T2 is waiting for T1.

The storage engine will not detect this deadlock because both dependencies are not under its control. The first dependency is known to the storage engine: T2 is holding a resource needed by T1. However, the second dependency is in the server, more precisely in the parallel replication code: T2 cannot commit before T1. To resolve this situation, the in-order parallel replication applier must detect that T2 is blocking T1 and it must kill T2. Once killed, T2 will release its resources, allowing T1 to proceed. T2 will then be retried.

This transaction killing and retrying does not happen very often in MariaDB 10.0, but its implementation is essential to avoid blocking replication. The occurrence of those retries can be monitored with the slave_retried_transactions global status. Below is a graph from such monitoring where we can see that three retries were needed for a four-minute interval. This is especially small considering that ~2.5K transactions were run per seconds on this slave (three retries for ~600,000 transactions).

E2 Conservative SBHD Partial Transaction Retries

Graph # 0: Retried Transactions on a MariaDB 10.0 Slave

Once this deadlock detection and resolution is implemented, the following becomes very straightforward:

Run all transactions in parallel, with in-order commit making sure that the data is consistent, and with deadlock detection avoiding replication to block.

This is called optimistic parallel replication and is implemented in MariaDB 10.1. This type of parallel replication does not rely on the master to identify parallelism. The slave tries to run as many transactions in parallel as possible to the limit of the slave_parallel_threads parameter while enforcing in-order commit. If no conflict happens, maximum speed is achieved. If a conflict (deadlock) is detected, a transaction that was optimistically run will be rolled-back to unblock in-order commit (this transaction will be retried later).

As long as there are not too many conflicts, or as long as the cost of rollbacks does not outweigh the benefit of running more transactions in parallel, optimistic parallel replication should give good results. What is yet unknown is how those results will compare to the conservative parallel replication (conservative is the MariaDB 10.1 name for the parallel replication in MariaDB 10.0). To answer that question, tests need to be done. The results are presented below.

Of course, the full implementation details are more complex (more details can be found in the annex about rollback and retries, DML vs DDL and non-transactional storage engines and the different optimistic parallel replication modes). However, this introduction is enough to understand the rest of this post, so let us go directly to benchmarking.

The Test

The test is the same as in the previous post: catching up with 24 hours of transactions. The four test environments are also very similar, their description can be found in the annex. In the graphs below, we compare speedups for non-optimistic executions (slave_parallel_mode=conservative) with speedups for optimistic execution (slave_parallel_mode=aggressive). We had to choose aggressive over optimistic because the latter will not give good results replicating via an intermediate master (more details can be found in the annex).

In the graphs, the Y axis shows the speedups with a value of one being the reference time without using parallel replication (slave_parallel_mode=none). The X axis shows the number of threads used: notice the logarithmic scale. The curve stopping at 40 on the X axis is showing the non-optimistic results. The data used to plot those graphs can be found in the annex.

Some reminders from the previous post:

Parallelism identification (slave group commit) was done with a maximum group size of 35, so increasing the number of threads past 40 for non-optimistic tests is not interesting.
The binary logging configuration had a noticeable impact on performance. The catch-up times are generally longer when enabling log-slave-updates, and disabling binary logging is not such a big win. The results below are obtained with binary logging enabled but with log-slave-updates disabled, referred to as SB in Part 3.
The workload of the four environments are different: E2 is a CPU-bound workload, E1 is also mostly CPU-bound but with some cache misses, E3 is a mixed CPU and IO workload, and E4 is an IO-bound workload.
As E4 is an IO-bound workload (mostly cache misses in the InnoDB buffer pool), reducing durability - referred to as ND in Part 3 - does not get a noticeable improvement (similar behavior has been observed for E3). For that reason, the results presented below only include high durability for E3 and E4 (referred to as HD).

Graph # 1a: E1 SB-HD - Conservative vs Aggressive Speedups
(CPU-bound workload with some cache misses)

E1 SB-ND Speedups Graph # 1b: E1 SB-ND - Conservative vs Aggressive Speedups
(CPU-bound workload with some cache misses)

E2 SB-HD Speedups Graph # 2a: E2 SB-HD - Conservative vs Aggressive Speedups
(CPU-bound workload)

E2 SB-ND Speedups Graph # 2b: E2 SB-ND - Conservative vs Aggressive Speedups
(CPU-bound workload)

E3 SB-HD Speedups Graph # 3a: E3 SB-HD - Conservative vs Aggressive Speedups
(Mixed CPU and IO workload)

E4 SB-HD Speedups Graph # 4a: E4 SB-HD - Conservative vs Aggressive Speedups
(IO-bound workload)

Discussion

The first surprise comes from observing speedups with very high number of threads. Even though the servers used for the tests only have 12 hyper-threaded cores for a total of 24 threads, speedups are still increasing past 80 threads for all configurations and up to 2560 threads for E3. Obviously, we cannot use more processing power when raising slave_parallel_threads from 40 to 80, so those growing speedups cannot be explained simply by using more CPU.

We think that those speedups are caused by replication prefetching. By increasing the number of threads, transactions that need data that is not in cache will trigger a read from disk earlier (prefetching). Triggering that read earlier is a big win and does not consume much CPU because the thread will go in IOWait state. Even if that transaction causes a conflict and is rolled-back, the data will be in cache for the retry. Also in this case, the cost of the rollback is insignificant compared to the gain of having the data in cache for the retry, so extra conflicts are not a problem. This concept is not new; it is known as replication prefetching and has already been discussed before by Baron Schwartz, Domas Mituzas and Yoshinori Matsunobu.

For all environments, aggressive parallel replication can produce much better speedups than conservative parallel replication. It is also very nice to see more significant speedups with low durability on E1 and E2 (conservative parallel replication is not giving good speedups there). It is absolutely obvious that aggressive parallel replication is a great improvement over conservative parallel replication.

I suspect (and I hope) that there are still bottlenecks to be removed. Optimistic parallel replication is a great improvement over conservative parallel replication, but we are still far from the speedup of 6 to 10 that we are looking for (maybe those expectations are unrealistic...).

Something that might slow down our tests are DDL (Data Definition Language: [CREATE | ALTER | TRUNCATE | DROP | ...] TABLE, ...) because DDL instructions are blocking the replication pipeline. As explained in detail in the annex, before being able to run a DDL, all previously started transactions need to commit. Moreover, before starting any transactions after a DDL, the DDL must complete. Below are some extracts from SHOW GLOBAL STATUS during the test:

E1: 197,499 Com_create_table
E1: 1148 Com_truncate
E1: 489 Com_drop_table
E1: 484 Com_rename_table

So in the quickest run of E1 (7060 seconds), we were doing in average 27 CREATE TABLE per second [5]. That obviously cannot be good for parallel replication. To ease identifying such problems, I opened MDEV-10664 - Add statuses about optimistic parallel replication stalls.

Conclusion

As stated in Part 3 (but is still worth repeating): it is possible to test MariaDB parallel replication even if the master is an old version of MySQL/MariaDB. In our case, our masters were running MySQL 5.6, but the same could be applied to other versions.

Overall, the optimistic parallel replication shows very promising results: almost all speedups are better than conservative parallel replication.

The biggest surprise was to see speedup increasing past 80 threads. We could have thought that more threads than processing units would slow things down, but it is not the case. This is probably caused by threads being most of the time in a waiting state: either waiting for a previous transaction to commit, or waiting for an IO. For a pure CPU workload, we can expect contention and this is probably what explains the thrashing for E2.

Another surprise is that the best speedup (3.78) is achieved for an IO bound workload (E4). In this case, the biggest win of parallel replication seems to be getting more read IOPS from the disks subsystem by scheduling many IOs in parallel. In this environment (and probably in others), optimistic parallel replication with a high number of threads is acting as a replication prefetcher.

Up to now, all our tests were done using magnetic disks. It is unclear how both conservative and optimistic parallel replication would behave with solid state disks (SSDs). More tests are needed to understand how parallel replication will behave with SSDs.

If you are interested in this topic and would like to learn more, I will be giving a talk about MySQL and MariaDB Parallel Replication at Percona Live Amsterdam in October. All talks by Booking.com are:

We will also be hosting the Percona Live Community Dinner on October 4th. You can also meet us there if you want to know more about the cool things we do at Booking.com.

[1] MySQL 5.7 and MariaDB 10.0 have slightly different implementation of parallelism identification on the master. MariaDB 10.0 uses the binary log group commit optimization as accurately described in Part 1 (the group commit id is shown in MariaDB mysqlbinlog output as cid). From version 5.7.6, MySQL is tagging each transaction with two logical timestamps (last_committed and sequence_number in MySQL mysqlbinlog output).

[2] MariaDB also has an out-of-order parallel replication functionality based on its GTID implementation. This type of parallel replication might not commit transactions in the same order as they appear in the binary logs of the master. To take advantage of out-of-order parallel replication, hints must be given by the application to advertise what can be run in parallel. This is not the type of parallel replication we are discussing in this post (we are focusing on the in-order type).

[3] By allowing out-of-order transaction commit on slaves, MySQL can alleviate a problem previously discussed in Part 3 where too small a value for slave_parallel_threads could give suboptimal speedups.

[4] The slave-preserve-commit-order option allows enabling in-order commit in MySQL but this option needs log-slave-updates. I opened Bug#75396 to have this restriction removed as needing log-slave-updates to enable slave-preserve-commit-order looks like a step backward:

The goal of parallel replication is to get faster replication
According to most of our tests, log-slave-updates slows down replication

Moreover, keeping a copy of the master transactions on slaves (log-slave-updates) is not needed when Binlog Servers are already keeping an exact copy of the binary logs of the master. So needing log-slave-updates to enable slave-preserve-commit-order is a restriction we could do without.

[5] This number of CREATE TABLE is very high. It looks like the pattern CREATE TABLE IF NOT EXISTS is very common for E1. Checking for table existence and avoiding the CREATE TABLE when it already exists might be a big win for parallel replication in E1. And for the people that do not believe 27 CREATE TABLE per second is possible (I also had to check it by myself):

$ for i in $(seq 4978 5003); do
  mysqlbinlog $(printf "binlog.%06d" $i) |
    grep -i "CREATE TABLE IF NOT EXISTS"; done | wc -l
211753

Eye tracking in our usability lab in Amsterdam headquarters. Booking.com desktop usability test

One of our most valued principles at Booking.com is putting the customer at the center of everything we do. It’s engrained into our day-to-day lives, and we’re always wanting to design experiences that our customers love. And this drive towards always creating user-centered design is made much easier by working within a fast-paced experimentation environment. At Booking.com, we have many ways of getting our customers involved directly — daily user feedback reports, usability tests, user research highlights, and even visiting our customers in their homes to see how our products fit into their daily routine. These kinds of sources are our main idea repository when we brainstorm new features or improve existing ones. And as designers, we use this kind of input to correct design decisions, work towards solving important UX issues and improve our products.

Why do we find it so important to listen?

Real people interact with our products. Real people are on the other side of those screens. Understanding what real customers need is the key to improving our product; the customer is king and keeping your customer happy can have a significant impact on your business.

92% of consumers believe recommendations from friends and family over all forms of advertising ~ source: Nielsen.com

64% of marketing executives indicated that they believe word of mouth is the most effective form of marketing. Only 6% say they master it ~ source: forbes.com

Think about the last time you had a disappointing experience with a product. How many of your friends knew about it within just a day?

We want customers to always be able to talk to us. We aim to reduce the friction our customers are facing in their efforts to find a place to stay and feedback from our users is a crucial source to find these points of friction. Avoiding friction means reducing bouncing, avoiding drops in conversion, frustration and reduction in word of mouth recommendations.

5 ways we listen to our customers

At Booking.com, we use a range of methods on a daily basis, all with the aim of improving user experience:

Online Survey Tools
Street level user testing
Usability tests
User research
Diary studies

1. Online Survey Tools — concept feedback

Every year, more companies than ever offer online survey tools. They’ve become smarter, better at targeting the right audience and it’s now easier to reach more people with lower implementation costs. It’s an effective way to get feedback at scale from the first day you launch. You can focus on specific parts of the customer journey and run it without leaving the office.

Feedback prompt using SurveyGizmo for Booking.com landers

At Booking.com we use tools like Usabilla.com, SurveyGizmo to collect feedback from for our app, email marketing campaigns and web products. Those tools can provide high quality feedback (especially for small website improvements), customer journey friction points and interactions that could be improved. Here’s an example of some high quality feedback:

The customer is frustrated, providing excellent feedback about a design element that can be improved

You can’t always expect the feedback to be constructive and usable to improve your products.

Not the most constructive feedback, but an important opinion nonetheless

Some of the most popular online tools can be found here.

Site exit surveys (Net promoter score)

Site exit tools let you collect feedback from potential customers who left your product. Customers might need more time than we’d expect to make up their mind, and even if they don’t intend to come back to our site their feedback can provide great insights. Some tools available to try out are Asknicely, Capterra, GetFeedback, 4Screens.

2. Street level user testing

Street user testing is the simplest and least expensive way to get user feedback. You can leave your office anytime with a prototype in your hands and reach out to your potential customers. Inarguable it can be pretty intimidating as you need to approach people and interrupt them from their daily routine, introduce yourself and additionally ask for their feedback. It’s up to you if you can offer some kind of perks when you get valuable insights. From personal experience, in the beginning it was hard to engage the participants to provide feedback. A common mistake was not setting the right expectations meaning if they will need to book a property and if they need to use their own devices. Then with trial and error we adjusted our approach and used the following greeting message:

Hello! We’re from Booking.com and we’d love 10–15' of your time to get your feedback about our website. We’ll provide you with a phone and you don’t have to really book an actual hotel — we just want to see how you use our site and how we can help improve it.

People might easily provide quick less meaningful feedback as a result of wanting to be nice to you or because they probably have more important things to do.

Street user testing is a method we love at Booking.com because no preparation is needed, we have a great park (Rembrandtplein) next to our offices that tourists visit every day and the ideas for improvement have been great so far.

3. Usability tests

Usability tests can provide fast and high quality feedback on your product. The customers follow their natural behavior while the session is being recorded and guided by a moderator.

Booking.com usability lab in Amsterdam headquarters

The success of usability tests depends on many factors, chief of which are the experience of the moderator and the quality of the participants. The most common approach is to outsource the selection of participants to a local recruiting company. Participants are invited into a friendly environment because making them feel as comfortable as possible will promote natural behavior.

If a company doesn’t have an in-house usability lab then this can be a significant additional cost to organize a usability test.

Usability lab is a very powerful tool that we all have access to at Booking.com. During every usability test we work with our researchers to help us extract the best quality feedback possible. A moderation training is essential for everyone that wants to moderate their own sessions. The training is necessary to avoid common mistakes like asking leading or closed questions although extensive experience is needed in order to master the art of moderation.

User testing through remote moderation tools

UserTesting.com allows you to connect with your customers remotely, assign them tasks and observe the results through a recorded video. It’s a great tool to conduct a usability test without having to invest on creating or renting a usability lab. The cost of each video is around $45 at UserTesting.com.

We also use this tool at Booking.com, especially when we want to approach customers in isolated parts of the world where it’s hard to have access to a usability lab.

Neuro-usability testing

At Booking.com, we’re always interested in trying out new, innovative methods of user-testing, and discovering how we can better understand our customers is a never-ending process. Neuro-usability is a way of measuring the Anxiety, Frustration and Excitement of the customer when they interact with our products, and, since this technology is available, we gave it a try:

Customer interacting with Booking.com Desktop room selection. Increased level of frustration

Customer interacting with Booking.com Desktop room selection. Graphs reflecting Excitement, Frustration and Attention

We have future plans to experiment with other forms of neuro-usability tests monitoring the levels of sweating and skin contractions that can be interpreted into meaningful emotional responses. Scientific and rock solid methodologies to understand user behavior will arise as technology evolves and we want to be on top of it.

4. User research

User research produces excellent quality feedback and in-depth insights. It gives you the opportunity to observe customers using the product for extended periods.

At Booking.com we do more than 120 research studies every year across the planet. Each user research focuses on specific personas depending on our business needs and growth strategies. For example, you can’t perform the same research for family travellers and business travellers. As the needs of different types of users are so different, your conclusions often can’t be easily consolidated. It’s incredible to see how differently the customers in China and India are interacting on the same product. Cultural differences change the perception, understanding, and expectations of the product.

Mumbai India user research with primary focus the slow(2G) internet connection

If you are an observer in a user research session, it is very important to focus on the bigger picture of the product. It is easy to draw false conclusions when observing a user just because your observation is aligned with your initial intentions to refine the product. By observing a behaviour you might believe that your idea is validated by the user.

Every user research session comes with a detailed report accessible to everyone in the company. The report that we mostly use as designers at Booking.com is the Research Highlights.

Mumbai user research highlights. Highlights are separated into meaningful categories

5. Diary Studies

Diary studies at Booking.com. Results are exposed for inspiration.

Our research team at Booking.com performs diary studies as well. Diary studies are a form of a long-term research with the same participants. The participants write entries with their activities, their thoughts and frustrations they face along their journeys.

The results of the diary studies are exposed in a prominent place in our headquarters to provide inspirational ideas for the relevant frontend teams.

The exposure of the results is a great example of feeling closer and listening to our customers. As the issues become personified they become more engaging.

Wrap up

All the suggested tools and methods are effective and work. It’s up to you to start with the most affordable and least time consuming method if you need some proof that it can help you create a better product.

A free trial is usually available to test drive the tools available in the industry. Find the most effective for your product or service and give it a go. A mix of several tools would be optimal but at least some kind of feedback tool should be living in your website or app.

Ideas that come from real customers when using your product are already validated and are more likely to give you successful results. Start listening to them.

In the end, it comes down to this simple question: How can you optimize the user experience of your products if you don’t know what’s wrong in the first place?

Some lessons learned from jumping into the deep end.

One of the core principles around which our way of working is based is the concept of the small, multi-disciplinary team. At Booking.com there is no design team, there is no dev team, there is no copy team. We’d rather refer to groups of people in the same field as communities — they scale so much better this way. While some of these communities may number in the hundreds a team is usually around six people working together on a specific problem. Six people with a good mix of skills: designers, developers, usually one product owner, sometimes a copywriter or a data scientist — whatever the scope of the team requires.

In my four years here I’ve been part of 7 teams; choosing a new team is always one of my favourite and yet most challenging moments. It’s not easy leaving behind a bunch of awesome people, even if sometimes you’re just moving a couple desks down. It’s not easy to stop working on problems you’ve been trying to solve for a while and passing them on to someone else. It’s not easy leaving what is already a comfort zone to work on something you may know little about. But, because it’s a great opportunity to learn something new, I almost never pass the chance.

There’s always a learning curve when changing teams. And, to me, that’s essentially the point of doing so, to put fresh eyes on a fresh challenge, bring new perspectives to the team and learn something in the process. If it’s easy — if it’s not a little terrifying — you’re probably doing it wrong.

It’s a quick, informal meeting. There are no chairs in the room, just a tall, round, yellow table standing on three legs. From the other side of the table comes a challenge: Do I want to join the email marketing team? My first thought is that I know absolutely nothing about email marketing and that in fifteen years as a designer I haven’t designed a single email. Sure, it’s a little terrifying but I see that more as a reason to say yes than to say no.

In about twenty minutes the weight of the challenge I’ve just accepted starts to become more and more visible. It’s fine. I’ve done this before, I know how to keep calm in this situation. Take a deep breath and repeat after me: It can’t be that hard. It’s a process. Steps. I can figure out the steps.

What I usually do about now is take a good look at the product. New eyes can spot a lot of things that might be hard to see if you know the context and story behind them too well. I do this first, even before talking to people on the team. In this case I’ve gathered all the email campaigns we send out. I make notes, usually in the form of questions, trying to look at them from the perspective of the user, trying to challenge everything that’s there and identify what might be missing or could be even better.

Questions about design: says who?, why do we recommend this?, why is this here?, based on what?

Kind of like this.

This usually works out very well and gives me a lot of ideas to test and validate (to deliver the best possible customer experience we test everything we do at Booking.com). Sometimes it’s things the team already tried, sometimes my ideas are stupid or impossible. Most of the times I’m wrong. But I can promise there’s a lot of value in a fresh, critical look.

This time though I also tried something else. Remember I knew nothing about email marketing? To fill in a little of that gap I’ve decided to also make note of everything I liked — everything that worked, everything that, as a user, made me happy and told me what I needed to know. When your goal is to improve something it’s easy to focus on the negative — after all you don’t need to improve what’s already great. But when your goal is to learn, looking at what works makes a lot more sense.

I ended up with almost two hundred positive notes. A lot more than I would have thought looking at the long list of things that could be improved. Looking through the tags I made, some patterns emerged pretty quickly. Before I knew it I had six design principles for designing email. I’d like to say I came up with these but they were pretty much already there — they just needed to be put into words. Looking at them now they have value beyond just email marketing (but I’d need 2000 more words to get into that).   Anyway, here they are:

1. Relevant

Relevant is hard. Relevant is hard enough , relevant NOW is even harder. But context is important since people started taking email with them everywhere. And I mean everywhere.

But “is this relevant” is not a yes or no question. Relevance is a spectrum that goes all the way from: “I literally couldn’t care less” to “spot-on amazing.” That feeling you get when Google or Facebook know you a little too well.

This is an example of a relevant email. Relevant because summer means I need sunblock and I need sunglasses.

Summer is here! Here's some stuff you might need -- list of sunblock and sunglasses

A relevant email

In the next example, because of some extra insight, the email is even more relevant. I have a sunny vacation coming up so I really need these things. I can’t imagine how how they would possibly know this, unless the same company sells flight tickets, sun screen and sunglasses. So maybe not a realistic scenario but it should make my point.

Your trip to Curacao is in 1 week. We can deliver these before you finish packing -- list of sunblock and sunglasses

A really relevant email

2. Personal

Speaks to me and fits my needs. I don’t think I need to sell you on this, personal emails work so much better than mass emails. You can look for opportunities to trigger emails based on user behavior or add personalized elements or recommendations in mass campaigns. Feel free to throw in a first name in there every once in a while, it’s a nice touch, but don’t assume that is enough to make the email personal.

Bike making funny sounds? That is because it has been a year already -- discount on bike maintenance

A personal email.

This is an example of a mail being personal. This company knows my bike has not been serviced in a year, they know exactly what bike I have, and they remind me to get it checked. Because they know all this they can give me the exact price, I don’t have to figure out myself what price matches my bike type. Throw my name in there and sign it with a person’s name, preferably the guy who will service the bike for me, and it becomes almost impossible to tell this is an automated marketing email, not a personal one.

3. Engaging

Engaging is about making me want to take action but also allowing me to take action. If it triggers my interest but there’s no action to take, I will have an unsatisfying experience. Sometimes I get recommendations for amazing products and then find out they don’t ship them to Europe.

Don’t do this to me! I know you know where I live.

You are so close! Daily reminder for your BBQ master class.

An engaging email

This is an example of engaging elements. I’m reminded of a goal I set, I can see all the progress I made and how close I am to the next milestone. It will be hard to resist taking action and going for that level 3.

4. Clarity

This one I think is clear (see what I did there?). Some examples: I’m subscribed to a lot of email marketing emails and the ones that are infrequent should include a way of reminding me who they are and why I subscribed. Or if I get a recommendation, let’s say for a book, then include a reason why I will like that book.

First ad reads: CF283A toner. Second ad reads: Toner for your HP Laserjet

A clear email.

In this example here I’ve replaced the actual name of the product with what it is for. To me, it seems clear that in one of these scenarios I am more likely to understand why that product is in my email. In the example on the left I’m likely to ignore this recommendation because the product name is so obscure. Stating that clearly, the way a human would speak to me if I walked in a store to buy this, makes the version on the right clear.

5. Control

Control is a tough one. One example from Booking.com is language. We send email in 43 languages. I see a lot of companies offering content in multiple languages and selecting a language for you based on your country. A person’s location hasn’t been an accurate way to determine what language they speak in…ever. We give subscribers control over this, even if it is not explicit. You change the language on the website, the language in your email will also change. No extra steps, it just works.

Would you like less email? Check boxes with turning various email subscriptions on and off

Giving user control — content preferences.

In this example you can let users control how much email they get, or what content they are interested in. This works great when you have a lot of different content. Your unsubscribe page is a good place to offer this as an alternative to completely opting out. This gets double points for making your emails a lot more relevant to this subscriber and for keeping them around.

6. Inspiration

One way to look at this is by asking what value does this email have to someone who cannot take action on it? What value it has to someone who cannot buy what you are selling or cannot directly benefit from what you are offering? If you take away the chance of a transaction is there any value or meaning left? It’s a good way to keep people opening and reading your emails.

This value can come from beautiful storytelling. It can come from giving some useful information without asking for something in return. It can come from offering a brief moment of delight. Charity Water sends out some great stories. Lonely Planet prioritizes great content over selling travel guides. Kickstarter puts lovely, if random, quotes in the beginning of all their emails.

And if all else fails a kitten GIF will do the trick.

Article title and email with a cute_kitten.gif placeholder

An inspiring email.

In this example, the story, the inspiring content takes the front seat and the selling can happen on the landing page. If they feel you’re not always trying to sell something, subscribers will look forward to your next email.

To sum up — I want to confidently say this about any email we send out: This email is relevant, personal and engaging because it offers clarity, control and inspiration.

This email is relevant, personal, and engaging because it offers clarity, control, and inspiration

And now I’ve got a framework of guiding principles. Yay me!

It took just a couple of days to come up with these principles and just a couple of weeks to refine them. That was a little too easy so I never fully trusted them. I’ve spent the next year putting them to test after test, waiting for the data, the numbers, to tell me I am wrong. To my surprise, I can tell you that didn’t happen.

What did happen is that I’ve found a new challenge, and a new team, and that I’m back working on web products. But if you want to design email, in 43 languages, for tens of millions of subscribers in over 200 countries and one of the largest digital ecommerce companies in the world you can take on the challenge! Don’t worry if email is not your thing we’re also looking for UX designers and mobile app designers.

So, what did I learn from creating my own design principles?

If you’re going to have design principles, define them early to make sure everyone is on the same page.
It’s important to get buy-in from everyone involved. Principles will not help much if you are the only one using them.
Validate your principles through research and experimentation. That will make the above point easier.
Be flexible, but don’t keep changing or breaking the rules.
Be positive. If you try to define what your design should not be you will end up with a very long list.

Without defining design principles first, it might seem like there are a million directions in which you can take a product, and a million different things you can try. But be honest with yourself: however much you’d like to, you can’t try every one of those things. You need some direction. It may seem counter–intuitive but having these principles written down is important because of the restrictions they set. These restrictions bring clarity, focus and help you make better design decisions.

In 2003 the Perl development community was made aware of an algorithmic complexity attack on the Perl's hash table implementation[1]. This attack was similar to reports over the last few years of attacks on other languages and packages, such asthe Java, Ruby and Python hash implementations.

The basic idea of this attack is to precompute a set of keys which would hash to the same value, and thus the same storage bucket. These keys would then be fed (as a batch) to a target which would then have to compare each key against each previously stored key before inserting the new key, effectively turning the hash into a linked list, and changing the performance profile for inserting each item from O(1) (amortized) to O(N). This means that the practice of loading arguments such as GET/POST parameters into hashes provided a vector for denial of service attacks on many HTTP-based applications.

As a response to this, Perl implemented a mechanism by which it would detect long chains of entries within a bucket and trigger a "hash split". This meant it would double the number of buckets and then redistribute the previously stored keys as required into the newly added buckets. If, after this hash split, the chain was still unacceptably long, Perl would cause the hash to go into a special mode (REHASH mode) where it uses a per-process random hash seed for its hash function. Switching a normal hash to this special mode would cause Perl to allocate a new bucket array, recalculate all of the previously stored keys using the random seed and redistribute the keys from the old bucket array into the new one. This mitigated the attack by remapping the previously colliding keys into a well distributed set of randomly chosen new buckets.

At this point the Perl community thought we had put the subject of hash collision attacks behind us, and for nearly 10 years we heard little more on the subject.

Memory Exhaustion Attack On REHASH Mechanism

Over the years occasionally the subject of changing our hash function would come up. For instance Jarkko made a number of comments that there were faster hash functions and in response I did a bit of research into the subject, but little came of this work.

In 2012 this changed. I was working on several projects that made heavy use of Perl's hash function, and I decided to invest some efforts to see if other hash functions would provide performance improvements. At the same time other people in the Perl community were becoming interested, partly due to my work and partly due to the publicity from themulti-collision attacks on Python's and Ruby's hash functions (MurmurHash and CityHash). Publicity I actually did not notice until after I had pushed patches to switch Perl to use MurmurHash as its default hash, something that got reverted real fast.

In restructuring the Perl hash implementation so it was easier to test different hash functions, I became well acquainted with the finer details of the implementation of the REHASH mechanism. Frankly it got in the way and I wanted to remove it outright. While arguing about whether it could be replaced with a conceptually simpler mechanism I discovered that the defenses put in place in 2003 were not as strong as had been previously believed. In fact they provided a whole new and, arguably, more dangerous attack vector than the original attack they were meant to mitigate. This resulted in the perl5 security team announcingCVE-2013-1667, and the release of security patches for all major Perls versions since 5.8.x.

The problem was that the REHASH mechanism allowed an attacker to create a set of keys which would cause Perl to repeatedly double the size of the hash table, but never trigger the use of the randomized hash seed. With relatively few keys the attacker could make Perl allocate a bucket array with up to 2^32 hash buckets, or as many as memory would allow. Even if the attack did not consume all the memory on the box there would be serious performance consequences as Perl remapped the keys into ever increasing bucket arrays. Even on fast 2013 hardware, counting from 0 to 2^32 takes a while!

This issue affected all versions of Perl from 5.8.2 to 5.16.2. It does not affect Perl 5.18. For those interested the security patches for these versions are as follows:

    maint-5.8:  2674b61957c26a4924831d5110afa454ae7ae5a6
    maint-5.10: f14269908e5f8b4cab4b55643d7dd9de577e7918
    maint-5.12: 9d83adcdf9ab3c1ac7d54d76f3944e57278f0e70
    maint-5.14: d59e31fc729d8a39a774f03bc6bc457029a7aef2
    maint-5.16: 6e79fe5714a72b1ef86dc890ff60746cdd19f854

At this time most Perl installations should be security-patched. Additionally official Perl maintenance releases 5.16.3, and 5.14.4 were published. But if you would like to know if you are vulnerable you can try the following program:

    perl -le'@h{qw(a h k r ad ao as ax ay bs ck cm cz ej fz hm ia ih is
      iz jk kx lg lv lw nj oj pr ql rk sk td tz vy yc yw zj zu aad acp
      acq adm ajy alu apb apx asa asm atf axi ayl bbq bcs bdp bhs bml)}
      =(); print %h=~/128/ && "not "," ok # perl $]"'

The following are statistics generated by the time program for the full attack (not the one-liner above) against a Perl 5.16 with and without the fix applied (identical/zero lines omitted) on a laptop with 8GB:

Without the fix patch (0ff9bbd11bcf0c048e5b3e4b893c52206692eed2):

         User time (seconds): 62.02
         System time (seconds): 1.57
         Percent of CPU this job got: 99%
         Elapsed (wall clock) time (h:mm:ss or m:ss): 1:04.01
         Maximum resident set size (kbytes): 8404752
         Minor (reclaiming a frame) page faults: 1049666
         Involuntary context switches: 8946

With the fix patch (f1220d61455253b170e81427c9d0357831ca0fac) applied:

         User time (seconds): 0.05
         System time (seconds): 0.00
         Percent of CPU this job got: 56%
         Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.09
         Maximum resident set size (kbytes): 16912
         Minor (reclaiming a frame) page faults: 1110
         Involuntary context switches: 3209

But this doesn't explain all of the changes in Perl 5.18

The observant reader will have realized that if we could patch older Perls to be robust to CVE-2013-1667 that we could also have patched 5.18 and avoided the problems that were caused by changing Perl's hash function. The reason we went even further than those maintenance patches is that we found out that Perl had further, if less readily exploitable, vulnerabilities to attack, and we wanted to do our best to fix them all.

This part of the story starts with Ruslan Zakirov posting a report to the perl5-security mailing list. The report outlined the basis of a key discovery attack on Perl's hash function. At first the Perl security team was not entirely convinced, but he then followed up with more code that demonstrated that his attack was sound. This development meant that the choice of a random seed would not make Perl's hash function robust to attack. An attacker could relatively efficiently determine the seed, and then use that knowledge to construct a set of attack keys that could be used to attack the hash function.

Nicholas Clark then ramped things up a notch further and did some in-depth analysis on the attack and the issues involved. At the same time so did Ruslan and myself. The conclusion of this analysis was that the attack exploited multiple vulnerabilities in how Perl's hash data structure worked and that the response would similarly require a multi-pronged approach.

Changes to the One-At-A-Time function

The first vulnerability was that Bob Jenkins' One-At-A-Time hash, which Perl used, does not "mix" the seed together with the hashed data well enough for short keys. This allows an attacker to mount a key discovery attack by using small sets of short keys and the order they were stored in to probe the "seed" and eventually expose enough bits of the seed that a final collision attack could be mounted.

We addressed this issue by making Perl append a four digit, randomly chosen suffix to every string it hashed. This means that we always "mix" the seed at least 4 times, and we mix it with something that the attacker cannot know. This effectively doubles the number of bits used for "secret" state, and ensures that short keys do not "leak" information about the original seed. The reason we use a suffix is that adding a prefix is the same as starting with a different initial seed state, so does not add any extra security. A suffix modifies the final state after the user input is provided and increases the search space an attacker must consider.

Related to this change was that the original One-At-A-Time function was potentially vulnerable to multi-collision attacks. An attacker could precalculate one or more suffixes such that

H(x) == H( concat(x, suffix) )

which would then allow an attacker to trivially construct an infinite set of keys which would always collide into the same bucket. We hardened the hash by mixing in the length of the key into the seed. We believe that this more or less eliminates the possibility of a multi-collision attack as it means that the seed used to calculate H( concat(x, suffix) ) would not be the same seed as H( concat(x, suffix, suffix) ). Cryptographers are invited to prove us wrong.

Reduce Information Leakage

The second vulnerability was that it is all too easy to leak information about the hash function to an attacker. For instance a web page might accept a set of parameters and respond with information for each of those parameters in the natural key order for the hash. This might provide enough information to mount a key discovery attack.

In order to prevent this information leakage we randomize the element order returned by the keys() and each() functions on the hash. We do this by adding a mask to each hash, and when an insert into the hash occurs we modify the mask in a pseudo-random way. During traversal we iterate from 0 to the k-th bucket and then XOR the iteration value with the mask. The result is that every time a new key is added to the hash the order of keys will change more or less completely. This means that the "natural" key order of the hash exposes almost no useful data to an attacker. Seeing one key in front of another does not tell you anything about which bucket the key was stored in. A nice side effect of this is that we can use this mask to detect people inserting into a hash during each() traversal, which generally indicates a problem in their code and can produce quite surprising results and be very difficult to debug. In Perl 5.18 when this happens we warn the developer about it.

A third vulnerability is related to the case where two keys are to be stored in the same bucket. In this case the order of the keys was predictable: the most recently added key would be "first" out during a keys() or each() traversal. This in itself allows a small amount of data to leak to a potential adversary. By identifying such a case one could find two (or more) strings which had the same least significant bits. By stuffing more keys into the hash and triggering a hash split an attacker could determine that the newly added bit of the hash value was different, or the same, for the two keys. Without the key-order randomization logic mentioned previously the attacker could also determine which of the two had a 1 or 0 in the most significant bit of the used part of the hash value.

While we were not yet able to construct an actual attack based on this information we decided to harden against it anyway. This is done by randomly choosing whether we should insert the colliding key at the top of a bucket chain or if we should insert at the second from top in the chain. Similarly during a hash split we also make such a decision when keys collide while being remapped into the new buckets. The end result is that the order of two keys colliding in a bucket is more or less random, although the order of more than two keys is not.

People complain. Randomization is good anyway!

We introduced randomization into Perl's hash function in order to harden it against attack. But we have discovered that this had other positive consequences that we did not foresee.

The first of these initially appeared to be a downside. Perl's hash function behaved consistently for more than a decade. Over that time Perl developers inadvertently created dependencies on this key order. Most of the examples of this were found in test files of CPAN modules: Many of us got lazy and "froze" a key order into the test code. For example by embedding the output of aData::Dumpercall into one's tests. Some of these however were real bugs in reputable and well tested modules.

By making Perl's key order random these dependencies on key order became instantly visible and a number of bugs that probably manifested themselves as "heisenbugs" became regular occurrences, and much easier to track down and identify. I estimate that for every two "non-bugs" (in things like test code) that we found, there was one "real bug" that was identified as well. Considering one of these was in the Perl core, and the other was in DBI, I personally consider this to be a good result.

Many people object that randomization like this makes debugging harder. The premise is that it becomes difficult to recreate a bug and thus debug it. I believe that in practice it is the opposite. Randomization like this means that a formerly rare bug becomes common. Which in turn means it becomes much more obvious that it is related to subtle dependencies on key order. Effectively making it much easier to find such problems.

A last benefit of randomizing the hash function is that we can now, at any time, change or replace the hash function that Perl is built with. In fact 5.18 is bundled with multiple hash functions including the presumed cryptographically strong Siphash. Since hash order from now on will be random, we don't have to worry if we change the function. External code should already be robust to the hash order being unpredictable.

How tangible are attacks on Perl's hash function?

There has been a lot of discussion on this subject. Obviously erring on the side of caution on security matters is the right course of action. Nevertheless there is a lot of debate on how practical an attack like this is in real production environments where Perl is commonly used, such as web servers. Here are some of the points to keep in mind:

Perls hash algorithm uses an array of buckets whose size is always between the number of keys stored in it and a factor of two larger. This means that a hash with 20 keys in it will generally have 32 buckets, a 32 keys hash will be split into 64 buckets, and so on. This means the more keys are inserted in a hash the less likely further keys will be put into the same bucket. So attacking a hash cannot make non-attack keys slower[2]. An attacker basically only slows their own fetches down and except as a by-product of resource consumption they will not affect other requests.
For an attack to reach DOS proportions the number of items inserted into the hash would have to be very, very large. On modern CPUs a linked list of thousands to hundreds of thousands of keys would be necessary before there was serious degradation of service. At this point even if the attack was unsuccessful in terms of degrading Perl's hash algorithm, it would still function effectively as a data-flooding denial of service attack. Therefore, focusing on the hash complexity aspect of the attack seems unwarranted.
Rudimentary out-of-band measures are sufficient to mitigate an attack. Hard restrictions on the number keys that may be accepted by publicly facing processes are sufficient to prevent an attack from causing any damage. For instance, Apache defaults to restricting the number of parameters it accepts to 512, which effectively hardens it against this type of attack. (This is one reason the attack on the rehash mechanism is so important, a "successful" attack requires relatively few keys.) Similarly, a well designed application would validate the parameters it receives and not put them in the hash unless they were recognized.
So long as the hash function chosen is not vulnerable to multi-collision attacks then simple per-process hash seed randomization makes the job of finding an attack keys set prohibitively difficult. One must first perform a hash-seed discovery attack, then generate a large set of keys. If restrictions on the number of keys the process will accept are in place then the keys must be very large before collisions would have a noticeable effect. This also makes the job of finding colliding keys all the more expensive.
In many circumstances, such as a web-service provider, the hosts will be behind load balancers. Which will either mean that every different web host uses different hash seeds, making hash seed discovery attacks very difficult. Or it requires an attacker to open a very long-running, persistent session with the server they wish to attack. This should be easily preventable via normal monitoring procedures.

For all these reasons it appears that hash-complexity attacks in the context of Perl and web hosting environments are of limited interest so long as:

The hash function does not allow multi-collision attacks.
The hash function uses at least a per-process hash seed randomization
The interface to untrusted potential attackers uses simple, hard limits on the number of keys it will accept.

These properties are relatively easy to accomplish without resorting to cryptographically strong hash functions (which are generally slow), or other complicated measures to prevent attacks. As the case of Perl's rehashing flaw has shown, the cure may be worse than the disease. The code that this CVE exploits for its attack was added to Perl as part of our attempt to defend against the hypothetical attack vector of excessive bucket collisions. We are at this time unaware of any real attack of this form.

Hash Functions For Dynamic Languages

It seems like the bulk of research on hash functions has focused on finding fast, cryptographically secure hash functions for long strings containing binary data. However, dynamic languages, like Perl, make heavy use of hash functions on small strings, which are often restricted to simple alphanumeric characters. Examples are email addresses, identifiers like variables and method names, single character keys, and lists of numeric ids and similar use cases. Not only are they relatively short strings, but they use restricted sets of bits in their bytes.

So far it appears that Bob Jenkins' One-At-A-Time-Hash with minor modifications provides an acceptable solution. It seems to have good distribution properties, is reasonably fast for short strings, and - with the hardening measures added in Perl 5.18 - it appears to be robust against attack. Analysis done by the Perl5 Security Team suggests that One-At-A-Time-Hash is intrinsically more secure than MurmurHash. However to my knowledge, there is no peer-reviewed cryptanalysis to prove it.

There seems to be very little research into fast, robust, hash algorithms which are suitable for dynamic languages. Siphash is a notable exception and a step forward, but within the Perl community there seems to be consensus that it is currently too slow, at least in the recommended Siphash-2-4 incarnation. It is also problematic that its current implementation only supports 64 bit architectures. (No doubt this will improve over time, or perhaps even already has.)

Universal hashing also seems promising in theory, but unproven in practice for cases like Perl, where the size of the hash table can be very small, and where the input strings are of variable size, and often selected from restricted character sets. I attempted to implement a number of Universal Hashing-style hash functions for Perl, and was disappointed by extremely poor distributions in the bucket hash for simple tasks like hashing a randomly selected set of integers in text form. This may have been a flaw in my implementation, but it appeared that Universal Hashing does not perform particularly well when the input bytes are not evenly distributed. At the very least further work is required to prove the utility of Universal Hashing in a dynamic language context.

The dynamic/scripting language community needs the academic computing community to provide a better tool box of peer reviewed string hash functions which offer speed, good distribution over restricted character sets and on short strings, and that are sufficiently hardened that in practical deployments they are robust to direct attack. Security is important, but theoretical attacks which require large volumes of key/response exchanges cannot trump requirements such as good distribution properties and acceptable performance characteristics. Perl now makes it relatively easy to add and test new hash functions (see hv_func.h), and would make a nice test bed for those interested in this area of research.

Afterwards

I would like to thank Nicholas Clark, Ruslan Zakirov, and Jarkko Hietaniemi for their contributions which led to the changes in Perl 5.18. Nicholas did a lot of deep analysis, and provided the motivation to create my attack proof, Ruslan provided analysis and a working key-discovery attack scripts on Perl's old hash function, and Jarkko motivated me to look into the general subject.

[2] If anything, such an attack might make access to keys that aren't part of the attack faster: The attack keys cluster in one bucket. That means the other keys are much more likely to spread out among the remaining buckets that now have fewer keys on average than without the attack.

The standard mechanism by which one can hook native extensions into Perl, dubbed XS for "eXternal Subroutine", is often criticized for its arcane syntax and for obscuring how the interaction with the Perl VM works while still requiring some familiarity from the programmer user. This article aims to provide a bottom-up approach to understanding how native extensions interact with Perl while eschewing XS altogether, hopefully providing complementary learnings.

This is not to discourage anybody from learning XS to extend Perl. Quite the opposite, XS can be a fantastically useful tool that automatically generates most of the code we will analyze below. It also provides an important portability and compatibility layer: The code in this article is not guaranteed to work on a future version of Perl and likely won't work on all platforms that Perl supports. In short, please do use XS for your real-world problems.

Off we go. Note that all code snippets in this article can be foundas a complete CPAN distribution on github. What we'll do first is come up with a C function that works the same as an XS-generated XSUB[1]. The function signature is:

voidmy_sum_xs_is_evil(pTHX_CV*cv)

The strange pTHX_ token is a Perl API macro and stands for "parameter for threading context". It will either be compiled out altogether if your Perl does not support ithreads[2], or it will be expanded toPerlInterpreter *my_perl,. Note the underscore. Including it results in the trailing comma. The cv parameter is basically the Perl subroutine reference that represents our XSUB from the point of view of Perl.

Before we dive into the actual function implementation, it pays off to briefly consider how Perl passes arguments to functions. Basically, it uses about 666 stacks flying in close formation. But we can get away with only considering two of them: The argument stack (also called the Perl stack) and the mark stack. The argument stack holds pointers to the actual function arguments, first to last. Below our data, it may also hold pointers to the arguments of our caller function, so we need the mark stack to tell us where our function's arguments begin: The mark stack holds integer offsets into the argument stack. When our function is invoked, Perl will remember the current position in the argument stack by pushing it as an integer onto the mark stack. Then it pushes the function arguments onto the argument stack, first to last. This means to get our data, we have to take the top element off the mark stack. From this offset into the argument stack all the way to the top of the argument stack, we can then access our parameters:

voidmy_sum_xs_is_evil(pTHX_CV*cv){/* Get the top "mark" offset from the stack of marks. */I32ax=*PL_markstack_ptr--;/* PL_stack_base is the pointer to the bottom of the   * argument stack. */SV**mark=PL_stack_base+ax;/* Local copy of the global Perl argument stack pointer.   * This is the top of the stack, not the base! */SV**sp=PL_stack_sp;/* And finally, the number of parameters for this function. */I32items=(I32)(sp-mark);inti;doublesum=0.;

Followed that? We now have a pointer to the first argument (in themark variable) as well as the number of parameters. The additional declarations will be used for our actual function below. Now on to actually using the function parameters and calculating their sum. (Remember? That was the point of the entire exercise.)

/* Move stack pointer back by number of arguments.   * Basically, this means argument access by increasing index   * in "first to last" order instead of access in   * "last to first" order by using negative offsets. */sp-=items;/* Go through arguments (as SVs) and add their *N*umeric *V*alue to   * the output sum. */for(i=0;i<items;++i)sum+=SvNV(*(sp+i+1));/* sp+i+1 is the i-th arg on the stack */

If we move the stack pointer back by the number of arguments, we can conveniently access the i-th argument by simple pointer arithmetic. Now, all we have to do is return one floating point number as a Perl SV. For this, it's helpful to know that Perl expects us to put return values in the same stack elements that we received the arguments in. If necessary, we can put additional return values in the stack slots after (or above) the ones that held the arguments.

constIVnum_return_values=1;/* Make sure we have space on the stack (in case the function was   * called without arguments) */if(PL_stack_max-sp<(ssize_t)num_return_values){/* Oops, not enough space, extend. Needs to reset the     * sp variable since it might have caused a proper realloc. */sp=Perl_stack_grow(aTHX_sp,sp,(ssize_t)num_return_values);}/* Push return value on the Perl stack, convert number to Perl SV. *//* Also makes the value mortal, that is avoiding a memory leak. */*++sp=sv_2mortal(newSVnv(sum));/* Commit the changes we've done to the stack by setting the global   * top-of-stack pointer to our modified copy. */PL_stack_sp=sp;return;}

First we make sure that the argument stack has enough space for our return value. Then we push a new, mortalized Perl SV (holding our sum) onto the argument stack. Umm, mortalized? Well, I lied earlier. The example requires three of Perl's stacks. Perl uses reference counting to manage its memory[3]. That doesn't come for free. As a very significant optimization (and an equally significant source of bugs) the Perl argument stack does not own a reference count of the values that are referenced from the stack. This poses a problem when returning a new value: The only reference to it will be on the argument stack, which refuses to own it! To work around this problem, Perl has the mortal stack (or tmps stack). The mortal stack owns a reference to each item that's pushed onto the stack. sv_2mortal above takes the newly created SV (which is created with a reference count of 1) and pushes a pointer to the SV onto the mortal stack. At the end of each executed statement, the mortal stack is cleaned up and the reference count of each item on the stack is decremented, possibly causing the item to be freed[4]. Thus the name: items on the stack are mortal, soon to pass away.

This is really all it takes to write an XSUB from scratch, without using the actual XS glue language. You lose the benefits of having automatic type-mapping for more complex types than just numbers, as well as many other useful bits, of course. The rest of the article will be concerned with actually hooking our XSUB into Perl. Again, no smoke and mirrors, without resorting to XS arcana where possible.

Our freshly devised C function needs to be compiled and linked into a shared library. For this purpose, we use the normal Perl module toolchain that otherwise builds XS modules. After writing the function to a .c file, we can get away with a minimal Makefile.PL configure-script such as this:

use5.008005;useExtUtils::MakeMaker;WriteMakefile(NAME=>'XS::WithoutXS',OBJECT=>'$(O_FILES)',# link all the object files);

Note that I've chosen to call the CPAN distribution XS-WithoutXS and the accompanying module/namespace will be XS::WithoutXS. The Perl toolchain takes care of compiling the .c files in the distribution directory and linking them (normally together with the output of the XS compilation) into a shared library blib/arch/auto/XS-WithoutXS/WithoutXS.so. Great. All that's left to do is write some Perl code to actually invoke the function. Normally, the standardXSLoader module will do all of this for us, but let's do it by hand. First, the usual Perl module boilerplate:

packageXS::WithoutXS;usestrict;usewarnings;useDynaLoader();my$SharedLibrary;

The most noteworthy bit is that we're loading the DynaLoader module. It is a low-level module provided by the Perl core that provides a portable way of locating symbols in shared libraries. Now, we need to locate the shared library that holds our manually-crafted XSUB.

sub setup_so_access{my$pkg=shift;returnifdefined$SharedLibrary;my@pkg_components=split/::/,$pkg;my$pkg_path=join"/",@pkg_components;my@dirs=(map"-L$_/auto/$pkg_path",@INC);my(@mod_files)=DynaLoader::dl_findfile(@dirs,$pkg_components[-1]);die"Failed to locate shared library for '$pkg'"ifnot@mod_files;$SharedLibrary=DynaLoader::dl_load_file($mod_files[0]);}

Much of this is really just to construct the path to the shared library that the toolchain generated for us. Then we add it to Perl's module search path and use the dl_load_file to let DynaLoader find it. We then link the first one that's found into our process. What's left is the logic to locate a symbol in the shared library and expose it to Perl as an XSUB.

sub newXS{my($perl_func_name,$full_symbol_name,$filename)=@_;my$sym=DynaLoader::dl_find_symbol($SharedLibrary,$full_symbol_name);die"Failed to locate $full_symbol_name"ifnotdefined$sym;DynaLoader::dl_install_xsub($perl_func_name,$sym,$filename);}

All the magic is happening in DynaLoader again: We locate the symbol by name in the shared library and then call dl_install_xsub, which will create a new CV (remember: a CV is a code/sub reference), and install it as the Perl function $perl_func_name. The filename is used for error reporting only. Thus we can wrap things up by a call to the two functions each:

setup_so_access(__PACKAGE__);# assuming package==dist name, see also belownewXS("my_sum","my_sum_xs_is_evil","this_is_not_xs.c");

And thus we can call it from Perl:

$ perl Makefile.PL$ make$ perl -Mblib -le 'use XS::WithoutXS; print XS::WithoutXS::my_sum(5,3,2,1);'
11

Voila. Writing extensions in C isn't that hard, is it? But tedious, you say? Indeed it is. That is why we have XS, the glue language, and its many tools. To make your life easier. Now that you understand how the basic interaction between your Perl and C code works, you can safely reach for the more convenient tools without worrying too much about the amount of magic under the hood.

[1] Apologies for hand-waving here. With "works", I mean that it has the same C function signature and when it's invoked, it will play nice with Perl and do all the things that Perl expects of a C function invoked via the extension mechanism. In XS jargon, such a function is called an "XSUB". I'll use that convention in the rest of the text.

[2] Technically, it's compiled out if your perl was not compiled with `MULTIPLICITY` support (a superset of ithread support), that is, if it does not support having multiple Perl interpreters running in the same process. Also, for completeness, let me point out that `pTHX` and `pTHX_` are accompanied by `aTHX` and `aTHX_`, the symmetrical incantations you need on the callsite of the function that has a `pTHX` in its signature.

[3] A previous article on this blog discussed debugging memory access problems and also gives a brief introduction into how Perl does reference counting.

[4] Technically, I'm oversimplifying here. But to be entirely honest, I'd have to introduce the "save" and "scope" stacks as well. But that doesn't actually have much practical relevance for what we're trying to do. So please excuse the minor inaccuracy.

Booking.com is displayed in dozens languages all over the world. We also support over fifty currencies. You can browse prices on our site in either the accommodation's preferred currency or a currency of your choice. While we try our best to guess your preferred language and currency based on factors like the browser's language and IP address, it's impossible to always get it right. So we allow users to pick any language/currency combination. It's quite possible that a user could be browsing the site in Japanese while using Indonesian Rupiah. The number of such users will probably be a small percentage of our traffic, but at our scale, small percentages are often big numbers.

For a designer or developer trying to get a feature right that's an enormous test matrix to consider.

Enter SilverHammer

Instead of testing every single language/currency combination, we identified a handful of languages and currencies that have often proven to be a challenge for our features. If your feature doesn't break under this group, you can be reasonably sure that it'll handle other language/currency combinations well. To allow designers and developers to quickly check their features and A/B experiments with these languages and currencies, we built a Chrome extension internally known as SilverHammer.

SilverHammer - our internal Chrome extension

Under the hood

At one of our hackathons, it occurred to us that the idea of an extension for internal testing might be useful outside Booking.com. So we started putting together a boiler-plate extension which could be extended by others. The Switcher Chrome extension is the result. It allows you to add multiple development, staging, production URLs and switch between them using keyboard shortcuts or a small popup menu next to the URL in the Chrome address bar. You can have any number of development, staging, production groups and you are also not restricted to just three URLs.

Switcher configured

Switcher's page action

The extension has four main components:

Options page:

This is where you configure the extension. Since this is just a plain HTML page, you can also use third-party javascript libraries. For example, for Switcher, we use AngularJS, which supports templating and two-way data binding. This means you don't have to write a lot of tedious code to wire data between DOM and your internal data structures.

When you load the options page for the first time, you get a simple UI:

Switcher's option page

The UI is bound to the following data:

groups=[{'params':false,'websites':[{'shortcut':'','url':'',},{'shortcut':'','url':'',},{'shortcut':'','url':'',}]}];

When you click the Add Group button, a new element is added to the groups[] array. Similarly, when you click the Add Item button, a new element gets added to the websites array under the respective groups element. Clicking the Save button stringifies groups[] to a JSON string and puts it into the Local Storage. It also generates a couple of supporting data structures to make the retrieval of a group corresponding to the user's current URL easy.

The params flag controls if the querystring parameters should be passed between the URLs in a group. For a typical set up that includes a development, staging, and production environment, this should be true.

Background page:

This page subscribes to the new tab notification and checks to see if the current URL is in one of the groups of URLs configured by the user. If it is, then it enables the Page Action icon in the address bar that allows users to switch between sites in a group. This page also loads the user settings from Local Storage and allows other components of the extension to query the settings rather than each component loading its own copy.

Page action:

This is the main UI that the users interact with. It queries the background page for the settings and builds a simple UI with a list of links to the URLs in a group. One caveat: links don't work inside a page action. You need to handle the clicks and explicitly open them.

window.onclick=function(event){if(event.target.nodeName=='A'){if(event.target.href){if(tabId){chrome.tabs.update(tabId,{"url":event.target.href});window.close();}}}}

Content script:

Content scripts allow you to inject code into the current page, manipulate its DOM and so on. Think of it as Greasemonkey specific to your extension. For example, SilverHammer injects links to our internal accommodation editing tools when you are visiting a particular accomodation page (you can configure whether it should do so from the options page). The content script in Switcher (switcher.js) registers a handler for the keypress event. It checks if a keystroke corresponds to a URL in a group and switches to it.

Limitations

Right now the extension allows you to either carry the URL parameters between websites or carry nothing.

The extension also doesn't support adding a port or username@password to the URL.

The future

Eventually, we'd like to explore the ability to switch the user agent when switching sites, or, cycling through user agents for the same site. This is a frontend development workflow which is becoming increasingly common. And while there are extensions that do a great job of switching just the user agent, we think there is some value in unifying it with the URL switching in addition to the development, staging, production switching workflow.

We hope you'll find this extension a useful starting point for your projects.

Get the code

You can find the code for the Switcher extension discussed in this article on GitHub.

[edit 2014-01-31]: Some small edits to better clarify the option page and the problems we were solving

Booking.com uses the Perl programming language heavily across its entire technical infrastructure. At the size of our infrastructure, even small performance improvements can translate into hefty savings, not to mention the exciting new features that we will implement with those spare CPU cycles. For this reason, we are very happy to announce that we are now funding Dave Mitchell, prominently known for his many years of high-quality contributions to the Perl language implementation, to improve the Perl runtime's performance one small step at a time.

The initial work started in February. Dave is working at the pace of his choosing and sharing some details on his progress on a monthly basis. Thesummary for February has already hit the development mailing list. Dave's summaries are necessarily concise as they're intended for experts. I'll expand on some of the changes here.

Dereferencing Overloaded Objects

Perl's support for overloading many common operations has ramifications throughout much of the runtime. Part of the strategy to keep the overhead of proliferating overload checks down is to encode the fact if there is any overloading into a flag on any given value which is inexpensive to check. In recent versions of Perl, bug fixes have exacerbated the performance impact in certain situations. In a nutshell, when a reference carries overloading for one operation (such as stringification) but not for another (such as dereferencing), the latter operation should not be slowed down significantly. Dave's work has improved the dereferencing of objects (like $obj->{foo}) which have overloading. The other operations no longer have such a dramatic performance penalty.

This means that we can have the more faithful overload support of recent Perls and have it be efficient, too.

Faster Array Access (Sometimes)

Perl comes with an optimization that makes array accesses faster if done with constant indexes. That means this

$foo[3]

is faster than this.

$foo[$bar]

This likely comes as a surprise to few. Alas, the optimization strikes a run-time/space trade-off in that it only has 8 bits to store constant indexes in. If you think outside the box just a little bit, you'll realize that this is not a huge limitation. Code that does $array[5913] is rare. Code like $array[0], that is small constant indexes, is the overwhelmingly most common case of constant-index accesses.

Dave's change is as simple as it is effective. Previously, indexes in the range between 0 and 255 were optimized. Dave changed this to be between -128 and 127. This is a net win because again, $array[-1] is common to get the last element, whereas$array[200] is very rarely seen with a literal constant.

Avoid Doing Meaningless Work in the Runtime

The output of the Perl compiler is an in-memory structure called the OP-tree. It mostly resembles the syntax of the Perl code. In fact it does so much that we can have tools like B::Deparse that recreate something very close to the original source. During a late compilation phase, Perl builds a separate linked list of these OP structures that represents the order in which they will be executed. The optimizer can determine that some of the OPs in this so-called execution chain actually serve no purpose. Depending on their type and location, it may choose to either patch them out of the chain completely (great!) or to replace them with no-ops (OPs of type OP_NULL). The latter happens when the optimizer is not smart enough to figure out all entry points that could lead to executing the soon-to-be-ex-OP.

Dave's work in this area means there are no more OPs of type OP_NULL that are ever executed. That is, any optimization now always manages to apply the more thorough elimination of needless OPs. This change still resides on a branch for testing.

Less Overhead on Common Hash Operations

In a nutshell, this work is going to make common initial operations on hashes more efficient. These are of the "enumerate all entries" variety:keys %hash, values %hash and @array = %hash. My colleague Yves Orton contributed a nice explanation of the work that Dave has put into this. Dave and Yves have been cooperating on this quite extensively:

Dave's keys() optimization is much deeper than meets the eye. Understanding what it does requires a little insight into how Perl's hash data structures are organized internally and a bit of their history.

Historically a hash was a fairly complex structure, which included preallocated storage for iterator state. This meant they were large. At some point it was realized that many hashes are never iterated. This meant that we could lazily allocate the iterator structure. The way we chose to do this was to embed it at the end of the bucket array, forcing us to realloc() the bucket array on "first iteration". This has the advantage that hashes which are not traversed are smaller in memory. The trade off is that the realloc() can be slow.

Dave's patch takes advantage of the observation that there is no need for per-hash iterator state when running keys() and similar operations. We traverse the whole hash in one go, so the iterator need not be stored against the hash, and instead a preallocated structure can be used by keys() for its purposes.

Until you consider hash order randomization, things are really that simple. But it so happens that keys() and each() must produce the same output order. The per-hash randomization data is stored in the iterator structure, so there is no way for the optimized keys() to guarantee the same order as a subsequenteach() would (vice versa is not a problem as each()must create the iterator struct). The solution is to figure out how to store the randomization data in the base structure.

Unfortunately, this isn't so easy. All members of the existing structure appear fully used, so we need to figure out how to squeeze more bits of data in without expanding space usage needlessly. Multiple options are still on the table. The structure currently looks like this:

structxpvhv{HV*xmg_stash;/* class package */union_xmguxmg_u;STRLENxhv_keys;/* total keys, including placeholders */STRLENxhv_max;/* subscript of last element of xhv_array */};

At the most abstract what we want to do is add another STRLEN (U32) member calledxhv_rand to the structure, but we want to do it without making the structure larger.

Now it turns out that xhv_max contains the size of the bucket array, and that is always a power of two. That in turn means it uses 32 bits to store 5 bits worth of information (for performance). So at the cost of some arithmetic we can squeeze out 27 valuable bits of space. xhv_keys contains the total number of keys, which in the current implementation will always be smaller than xhv_max, so hypothetically we could somehow store both in the same field. There is also the observation that the random data need only be as long (in terms of bits) asxhv_max is. This means that if we could ensure that any hash with more than K buckets must have a preallocated iterator structure, then for smaller hashes we can store the max (or keys) and random data in the same struct member.

An entirely different solution would be to use a hash of the address of the data structure as the initializer to xhv_rand(), this however has the potential to reduce security.

What was worked on was the logic to guarantee that hashes of a certain size have a preallocated iterator structure, plus, the logic to pack the keys and randomization data into the xhv_keys structure when the number of buckets is smaller than 2**16. This also included an implementation of hashing the data structure address as the initializer.

The changes discussed here will not land in the main development branch (blead in perl parlance) before the Perl 5.20 release.

More Compact Code for Leaving Scopes

When leaving a lexical scope, perl has to do a fair amount of work such as reference count related book keeping on things that have gone out of scope. This is a very hot code path (the leave_scope() function in scope.c). Dave tweaked part of this function and it now has smaller object code (97 bytes less). The new version skips a lot of tests for "simple" lexical variables going out of scope.

Micro-Optimizations for Perl Function Calls

Dave worked on micro-optimizing the implementation of the entersub OP, that is the OP that executes the majority of logic related to Perl function calls. The common code paths of this hot function have been tweaked, and branch prediction hints added where appropriate. The changes reduce the object size of the function by 11%.

Summary

Most of the above changes all small tweaks, but an excellent start into the project. I'm incredibly happy that Dave is willing to work on Perl for us and to everyone's benefit.

At Booking.com, we use MySQL to operate our web site where tens of million of pages are viewed and over 625,000 room nights are reserved every day. This is a lot to ask for our database infrastructure. To be able to sustain such load, we split, shard and replicate our data extensively. Unfortunately, a conventional master-slave setup has a problem at our scale and that problem is specifically around failing over master databases and repointing slaves. Luckily, there are some advancements in the MySQL space that can help us achieve the flexibility we need. And like any new technology there is always a drawback.

MySQL 5.6 and MariaDB 10.0 implement a feature called Global Transaction IDs (GTIDs) to provide a facility for easily repointing a slave to a new master. In previous versions, doing this was fairly error prone and not always possible without lots of manual work. This feature could save us quite a bit of headache except MySQL 5.6 GTIDs lack a reasonable deployment strategy (bug#69059). MariaDB 10.0 does not have this limitation, but since a large part of our infrastructure is MySQL we need to be able to deploy these changes with a minimal amount trouble. Oracle is working on allowing setting GTID_MODE=ON online (worklog#7083) but we could use this feature a lot sooner than in MySQL 5.7.

Before going further, let's briefly describe replication with GTIDs in MySQL 5.6. When GTIDs are disabled (the default configuration) all transactions are ANONYMOUS meaning there isn't a global identifier, and instead, each transaction is identified by file and position only. When GTIDs are enabled, each transaction is assigned a globally unique identifier stored in the binary logs of the master. When the slave downloads the binary logs and executes the transactions, it remembers their GTIDs. If then we need to repoint the slave, the GTIDs stored on the slave are sent to the new master which uses them to determine what transactions need to be sent downstream.

When a new replication tree is built using MySQL 5.6, it can be initialized with GTIDs enabled. However, with an existing replication tree (with GTIDs disabled), enabling GTIDs requires a step where all databases are stopped: a GTID enabled database was not designed to replicate from/to a GTID disabled database. To enforce that, a check is done at slave start up validating that the master GTID mode is compatible with the slave mode. This is a serious hurdle.

We have lots of very large replication trees. We cannot stop all databases in a tree simultaneously without a severe negative impact to our customers. Plus, before we can enable GTIDs everywhere, we need to evaluate the feature and train ourselves on its usage. And it requires adjusting our automation and monitoring systems to be GTID aware.

That isn't to say we aren't interested in the feature, but there had to be a better way to try it out and also deploy it without killing our business. We believe we have found that solution.

While testing GTIDs in our lab environment, we were able to replicate ANONYMOUS transactions to a GTID enabled slave. This can lead to unexpected results because each slave would generate different GTIDs for the same transaction. We reported this as a bug (#71527). However, after opening this bug, we realized that this behavior can be used to our advantage to evaluate and migrate to GTIDs. Basically, we want to introduce an intermediate master to generate GTIDs in a replication tree. We submitted a feature request specifically for this mode of operation (new GTID mode: ANONYMOUS_IN-GTID_OUT (#71543)). After some deeper investigation, we realized that this mode can be implemented with very small patch to the MySQL source. All we needed to do was disable the GTID mode compatibility check on the intermediate master.

From the root of the MySQL source tree, in the file sql/rpl_slave.cc, there is a function called get_master_version_and_clock(). In this function, the following code is responsible for checking the slave and master having compatible GTID modes:

if(mi->master_gtid_mode>gtid_mode+1||gtid_mode>mi->master_gtid_mode+1){mi->report(ERROR_LEVEL,ER_SLAVE_FATAL_ERROR,"The slave IO thread stops because the master has ""@@GLOBAL.GTID_MODE %s and this server has ""@@GLOBAL.GTID_MODE %s",gtid_mode_names[mi->master_gtid_mode],gtid_mode_names[gtid_mode]);DBUG_RETURN(1);}

Commenting this block of code results in amysqld with the new ANONYMOUS_IN-GTID_OUT mode when setting its GTID mode to ON.

Let's see how this new mode can be used. We will work on the replication tree below where A is the master and B to F are slaves. For simplicity, we suppose that binary logs and enforce_gtid_consistency are already enabled on all databases.

-----
| A |
-----
  |
  +------+------+------+------+
  |      |      |      |      |
  V      V      V      V      V
-----  -----  -----  -----  -----
| B |  | C |  | D |  | E |  | F |
-----  -----  -----  -----  -----

The first step to evaluate/migrate to GTIDs is to restart B in the new GTID mode ANONYMOUS_IN-GTID_OUT. For us, that means deploying our patched mysqld and enable GTIDs on B. Then, C can be migrated under B using the following commands:

#!/bin/bashmaster=b.example.comslave=c.example.com# Stop the slave replication and note its position relative to its master.s_sss=$(ssh $slave mysql <<<"STOP SLAVE; SHOW SLAVE STATUS\G")s_RMLF=$(awk '$1 == "Relay_Master_Log_File:"{print $2}'<<<"$s_sss")s_EMLP=$(awk '$1 == "Exec_Master_Log_Pos:"{print $2}'<<<"$s_sss")# Wait for the new master to be at the same level as the slave (or ahead),#   stop its replication, note its position, and start back the replication.m_sss=$(ssh $master mysql <<<"SELECT MASTER_POS_WAIT('$s_RMLF', $s_EMLP);                               STOP SLAVE; SHOW SLAVE STATUS\G START SLAVE;")m_RMLF=$(awk '$1 == "Relay_Master_Log_File:"{print $2}'<<<"$m_sss")m_EMLP=$(awk '$1 == "Exec_Master_Log_Pos:"{print $2}'<<<"$m_sss")m_EGS=$(awk '$1 == "Executed_Gtid_Set:"{print $2}'<<<"$m_sss")# Restart the slave replication until the stopped position of the new#   master and wait that this position is reached.
ssh $slave mysql <<<"START SLAVE UNTIL MASTER_LOG_FILE = '$m_RMLF',                                        MASTER_LOG_POS  = $m_EMLP;                      SELECT MASTER_POS_WAIT('$m_RMLF', $m_EMLP);"# Reset the slave replication, restart the database with GTID_MODE=ON,#   initialize GTID_PURGED, change its master and start back replication.
ssh $slave mysql <<<"STOP SLAVE; RESET SLAVE ALL;"
ssh -n $slave sed -i -e "'/^GTID_MODE/s/OFF/ON/'" /etc/my.cnf
ssh -n $slave /etc/init.d/mysql restart
ssh $slave mysql <<<"SET GLOBAL GTID_PURGED='$m_EGS';                      CHANGE MASTER TO MASTER_HOST          = '$master',                                       MASTER_USER          = 'repl',                                       MASTER_PASSWORD      = 'pass',                                       MASTER_AUTO_POSITION = 1;                      START SLAVE;"

After repeating the commands above with D, we have the new replication tree below where GTIDs are partially enabled:

WRITES on A are propagated to B, E and F,
B generates GTIDs for those transactions,
the generated GTIDs are propagated to C and D.

    -----
    | A |
    -----
      |
      +--------------------+------+
      |                    |      |
      V                    V      V
    -----                -----  -----
    | B |                | E |  | F |
    -----                -----  -----
      |
      +------+
      |      |
      V      V
    -----  -----
    | C |  | D |
    -----  -----

After evaluating GTIDs on B, C and D (and learning about how to operate them in production, including handling of unexpected problems), enabling GTIDs everywhere is straightforward:

Move E and F under B (see the commands above),
Point the WRITES to B,
Move A under B.

The resulting replication tree is the following:

-----
| B |
-----
  |
  +------+------+------+------+
  |      |      |      |      |
  V      V      V      V      V
-----  -----  -----  -----  -----
| A |  | C |  | D |  | E |  | F |
-----  -----  -----  -----  -----

B can stay in the mode ANONYMOUS_IN-GTID_OUT as it is not a slave anymore. However, if we want to have A as the master and/or all nodes in pure GTID mode, we can:

Move C through F under A,
Point the WRITES to A,
Restore original binaries on B,
Move B under A.

And that is how we've been evaluating MySQL 5.6 GTIDs. In some of our replication trees we've deployed a patched mysqld that has ANONYMOUS_IN-GTID_OUT mode enabled. And under those intermediate masters, we've deployed unpatched mysqld with GTIDs enabled.

However, we are not quite ready to completely switch our mission critical databases and replication trees over to GTIDs. We need to continue evaluating to be sure we have not missed something important. Nonetheless, we are very happy to have unlocked a path to online migration.

If you want to know more about the cool things we do with MySQL at Booking.com and if you are at Percona Live at the beginning of April, we are giving a talk on Data materialisation for high-performance Web front-ends. You can also come to meet us at booth #318. We are also looking for smart people to work with us on MySQL and help us solve deeper problems similar to implementing GTIDs, see our careers page for more details.

CSS sprites have come a long way in the ten years since Dave Shea first wrote about them for A List Apart, way back in 2004.

A CSS sprite is the technique of combining multiple images into a single image, and selectively displaying only parts of that image using the CSS background-position property. It was initially used mostly to make :hover states load quicker than the then commonly-used JavaScript onmouseover() equivalent. But it quickly became common practice for sites to bundle all of their icons and decorative images into a single optimised “sprite”. As well as faster rollovers, it also improved site performance, since downloading a single sprite is almost always faster than downloading each individual image separately.

The drawback with using CSS sprites was the extra work involved in laboriously measuring and transcribing the coordinates for each different element of your sprite. For each icon you needed to know its position within the sprite, and often its width and height as well. This also meant that removing individual icons was a painful process, as any change to the layout of the sprite meant recalculating all of those numbers.

While this approach might have been sustainable when developing a relatively small site on your own, it doesn’t really scale up when working in large teams. At Booking.com we have dozens of designers working across many different parts of the site, so we need a solution that is as resistant to mistakes as possible. We need automation.

Automation

In the last couple of years there have been incredible leaps forward in automating many of the common tasks that web developers have been used to doing by hand. We can now choose from ready-rolled templates like HTML5 Boilerplate and Rock Hammer. Or we can use entire UI libraries such as Twitter Bootstrap or Zurb Foundation. And we even have CSS pre-processors like LESS and SASS. This move towards automation culminated in task runners such as Grunt and newcomerGulp which allow developers to write and run very simple tasks to automate away much of the boring, repetitive parts of their job.

For managing CSS sprites, there are several Grunt tasks out there that we could choose from. Some of the most popular include:

Most of the sprite tools share many of the same configuration options, allowing you to specify source and destination folders, CSS class names, the space you want to leave between images, and the packing algorithm to use. Some of the more advanced ones offer the ability to output both SVG and PNG sprites, and @2x retina-ready sprites. The CSS output can often be specified as plain CSS, or in LESS or SASS format.

When considering a solution that would be useful for a large team, we had a few specific requirements in mind:

It must work if we only have PNG files as input since not all web designers are comfortable working with SVG files, or have a license for a vector-graphic application
It must enable automating pseudo-classes like :hover and :active
It must be possible to integrate into our existing build system

While Grunt tasks are fun to play with, none of the ones we looked at satisfied all of our requirements.

Glue

Glue is a command-line only tool that is highly configurable and offers all of the features we were looking for. It accepts a folder or multiple sub-folders of PNG files as input. Generating pseudo-classes is handled through file-naming. For example, if you have two files named “foo.png” and “foo__hover.png”, the generated CSS will contain the :hover rule for your .foo class.

The default settings for Glue take a source directory full of images and outputs a sprite and a set of CSS rules based on the file names of the icons:

.sprite-source-foo,.sprite-source-bar,.sprite-source-baz{background-image:url('source.png');background-repeat:no-repeat;}.sprite-source-foo{background-position:00;width:25px;height:25px;}.sprite-source-bar{background-position:0-25px;width:35px;height:15px;}.sprite-source-baz:hover{background-position:-37px-12px;width:12px;height:12px;}.sprite-source-baz{background-position:-25px-12px;width:12px;height:12px;}

As you can imagine, this output can get quite big when working with large numbers of images. There are better ways to write those CSS declarations, especially that first line. Luckily, one of the configuration options Glue offers is the ability to specify a Jinja template to use when generating the style sheet. Jinja is a simple Python templating engine. This allowed us to reduce the size of the resulting rules dramatically, and also add comments to warn other users that the file was auto-generated:

/* This file is generated by an automatic script.   Do not attempt to make changes to it manually! */.sprite{background-image:url('/path/to/sprite.png');background-repeat:no-repeat;}.foo{background-position:00;width:25px;height:25px;}.bar{background-position:0-25px;width:35px;height:15px;}.baz:hover{background-position:-37px-12px;width:12px;height:12px;}.baz{background-position:-25px-12px;width:12px;height:12px;}

With this new sprite process in place, we can now create new sprites in just a few simple steps:

Drop a new image into the /source folder.
Run the Glue command to re-sprite the images together and re-generate the CSS.
Add the appropriate markup to the page: <i class="sprite foo"></i>

You can, of course, debate the semantic appropriateness of abusing the <i> element in this way. The benefit of using this type of markup for sprites is that it will be familiar to anyone that has used Bootstrap icons.

Compression

While Glue originally came bundled with the OptiPNG library, it was removed in version 0.9, so it is highly recommended to run the resulting sprite through an optimisation tool before putting it live. There are many to choose from, both online and command line based, including:

TinyPNG.com
Smushit
Pngcrush
ImageOptim
...and many others, including some that are available as Grunt tasks.

Challenges

While many icons and decorative images are fairly simple to drop into a design, there are some challenges when using sprites.

Hovers on parent elements

While Glue provides a simple way to specify the :hover image for an individual icon, it can’t know when you want an icon to change in response to a parent element being hovered, e.g. changing an icon’s colour when the entire <div> is hovered. This common pattern can be addressed through clever manipulation of the Jinja template:

.sprite{background-image:url('/path/to/sprite.png');background-repeat:no-repeat;}{%forimageinimages%}{%ifimage.pseudo%}.sprite.sprite-container{{image.pseudo}}.{{image.label}},{%endif%}.{{image.label}}{{image.pseudo}}{background-position:{{image.x~('px'ifimage.x)}}{{image.y~('px'ifimage.y)}};width:{{image.width}}px;height:{{image.height}}px;}{%endfor%}

Here we are checking for a pseudo state, and if one is present we add an extra rule that triggers the image change if a parent element with the specific class of .sprite-container is hovered as well. Now we can create markup like this:

<divclass="calendar sprite-container"><iclass="sprite calendar-icon"></i><h2>Calendar</h2></div>

When the <div> is hovered, the hover state of the icon will be triggered. A similar trick can be used to implement a ‘selected’ state as well.

Identical images

A harder problem to solve is what to do about duplicate images. If you use the same icon to stand for multiple different things, you either have to use the same class name for all of those things, which is not very flexible when you’re working with data coming from the back end, or put several differently-named identical images in your /source folder and sprite, which is not brilliant for file size. For now we’re using duplicate images, but we continue to investigate alternatives.

Summary

If you want to make the best use of CSS sprites in a large organisation, and for performance reasons you really should, then you’re going to need to make it as easy as possible for everyone that uses them to work with the same centralised source image. Automating the task of adding and removing individual images from the company sprite removes a lot of the hassle and wasted time associated with working with sprites.

In a recent post on his blog Matthew Horsfall explained a few of the optimizations to Perl that he contributed recently. On this site, we reported on thefantastic work that Dave Mitchell is doing. The two of them inspired me to try my hand at this optimization game at a recent hackathon at work by improving the Perl optimizer.

But first, let's discuss a bit about how the Perl compiler works and its work products.

The output of the Perl compiler is two-fold. The first is an in-memory data structure, that resembles a traditional compiler'sAbstract Syntax Tree. This is called the OP tree since it consists of struct OPs that generally represent someOPeration to be performed. The second is a linked list that connects (some of) these OPs in the order they are to be executed[1].

So for the code my $a = $b + 3, we have this tree

sassign-+-add-+-padsv[$b]
        |     `-const[3]
        `-padsv[$a]

where sassign indicates scalar assignment, and padsv is a lexical variable access. It helps to think of perl as a stack machine, so the linked list for execution would be this sequence (with annotations on what they actually do):

padsv[$b]   (push $b on stack)
const[3]    (push 3 on stack)
add         (pop 3 and $b off of stack, push their sum on stack)
padsv[$a]   (push $a on stack)
sassign     (pop $a and the sum off of stack, set $a)

Things get just a tiny bit more complicated if you consider operations that take a variable number of items as parameters such as taking a slice of an array. Consider @x[1,2]

aslice-+-pushmark
       |-list-+-pushmark
       |         |-const[1]
       |         `-const[2]
       `-padav[@x]

The array slice needs two pieces of information: a list of indexes to slice from the array, and the array itself. In Perl, the way to pass an arbitrary number of items to an operation is to remember the stack position (which is what pushmark does[2]) and then simply push the items onto the stack. The recipient simply looks at the most recently remembered stack position and pops its parameters off the stack until it reaches this position. aslice is such an operation that receives a variable number of parameters, so the first thing to do is to execute a pushmark. The execution order of operations here is:

pushmark (outer)
pushmark (inner)
const[1]
const[2]
list
padav
aslice

This is where it gets a bit weird. One would expect to simply execute the two consts and the padav now. That would correspond to this tree

aslice-+-pushmark
       |-const[1]
       |-const[2]
       `-padav[@x]

and this execution order.

pushmark
const[1]
const[2]
padav
aslice

The list operation has a very simple purpose. It looks up the most recently remembered stack position. Then it checks which context it was invoked in (context in the normal Perl sense of context). If it was invoked in list context, then it will simply do nothing and leave the items on the stack. If it was invoked in scalar context, it will only leave the last, topmost item on the stack. Let that sink in for a moment.

Indeed, in list context, list simply undoes the effect of one precedingpushmark. The two operations of the inner pushmark and the list cancel each other out entirely. This is one of the places where the OP tree's dual nature as an AST and as a data structure intended for the actual execution of the program shows.

And here is where my optimization comes into play: In Perl 5.20, the optimizer is just a tiny bit smarter now because it can detect list / pushmark pairs that will be a no-op. It modifies the execution chain to not execute them while leaving the AST intact for decompilers and similar tooling. Because the internet has informed me that optimizations without benchmarks to "prove" their effectiveness are meaningless, let's indulge in some micro-benchmarking while investing some effort into making the result somewhat meaningful. What we'll benchmark is simply array slicing @y = @x[3,7].

use5.14.2;usewarnings;useDumbbench;useDumbbench::CPUFrequencyPinner;my$opt='/path/to/optimized/perl';my$unopt='/path/to/unoptimized/perl';my$code=<<'HERE';my @x = (1..8);my @y;@y = @x[3,7] for 1..1e6;HEREmy$db=Dumbbench->new(target_rel_precision=>0.005,initial_runs=>20,);$db->add_instances(Dumbbench::Instance::Cmd->new(command=>[$unopt,'-e',$code],name=>"before"),Dumbbench::Instance::Cmd->new(command=>[$opt,'-e',$code],name=>"after"),);SCOPE:{# Pin CPU frequency down to reduce variabilitymy$p=Dumbbench::CPUFrequencyPinner->new;$SIG{INT}=sub {undef$p;exit;};$p->set_max_frequencies($p->min_frequencies->[0]);$db->run;}$db->report;$db->box_plot->show();

This type of graph is generally referred to as a box plot and it shows the measurement results with additional information on the variability of the repeat measurements. "after" refers to a perl built from commit7d3c8a6837b55fff whereas "before" is a perl built from the commit prior to that. This particular example benchmark shows a statistically significant speed-up of 8% in slicing arrays. You can see similar speed-ups in hash slices, certain map blocks, and a few other list-type operations. For a day's work on a very mature code base, that's not too shabby at all!

[1] Technically speaking, there are many such linked lists since a logical operation will create a branch. So it's really a tree. But many stretches of code will be contained in linear sections, so please excuse me for the slight oversimplification.

[2] The name pushmark stems from the fact that the way to remember multiple stack positions in nested operations is done with another stack, the MARK stack. So pushmark does very much what's on the tin.

At Booking.com, we have very wide replication topologies. It is not uncommon to have more than fifty (and sometimes more than a hundred) slaves replicating from the same master. When reaching this number of slaves, one must be careful not to saturate the network interface of the master. A solution exists but it has its weaknesses. We came up with an alternative approach that better fits our needs: the Binlog Server. We think that the Binlog Server can also be used to simplify disaster recovery and to ease promoting a slave as a new master after failure. Read on for more details.

When having many slaves replicating from the same master, serving binary logs can saturate the network interface of the master as every change is requested by every slave. It is not unusual to have changes that generate lots of binary logs, two examples are:

deleting lots of records in a single statement when using row based replication
executing an online schema change on a large table

In the replication topology described in Figure # 1, producing one megabyte of binary logs per second on M will generate replication traffic of a hundred megabytes per second if we deploy a hundred slaves. This is very close to the limit of a 1 Gbit/s network interface and this is something we see happening in our replication chains.

-----
| M |
-----
  |
  +------+------+--- ... ---+
  |      |      |           |
  V      V      V           V
-----  -----  -----       -----
| S1|  | S2|  | S3|       | Sn|
-----  -----  -----       -----

Figure # 1: Replication Tree with many Slaves.

The traditional solution to this problem is to place intermediate masters between M and its slaves. In the deployment of Figure # 2, instead of having slaves directly connected to M, we have intermediate masters replicating from M with slaves replicating from each intermediate master. When having 100 slaves and ten intermediate masters, this allows producing ten times more binary logs on M before saturating the network interface (ten MB/s instead of one MB/s).

-----
| M |
-----
  |
  +----------------+---- ... ----+
  |                |             |
  V                V             V
-----            -----         -----
| M1|            | M2|         | Mm|
-----            -----         -----
  |                |             |
  +------+ ...     +---- ...     +--- ... ---+
  |      |         |             |           |
  V      V         V             V           V
-----  -----     -----         -----       -----
| S1|  | S2|     | Si|         | Sj|       | Sn|
-----  -----     -----         -----       -----

Figure # 2: Replication Tree with Intermediate Masters.

However, using intermediate masters has its problems:

replication lag on an intermediate master will generate delay on all of its slaves
if an intermediate master fails, all of its slaves stop replicating and they need to be reinitialized [1]

Diving deeper on the second problem in the context of Figure # 2, one could think that, in case of a failure of M1, its slave could be repointed to the other intermediate masters but this is not that simple:

S1 replicating from M1 depends on the binary logs of M1,
M1 and M2 have different binary logs (they are different databases),
manually extrapolating the position of S1 in the binary log of M2 is hard.

GTID can help us in repointing slaves, but it will not solve the first problem above.

It should be noticed that we do not need the databases in the intermediate layer at all: we only need to serve binary logs. And, if the binary logs served by M1 and M2 were the same, we could easily swap each of their respective slaves. From those two observations, we build the idea of the Binlog Server.

Binlog Servers replace intermediate masters in Figure # 2. Each Binlog Server:

downloads the binary logs from the master
saves them to disk using the same structure as the master (filename and content)
serves them to slaves as if they were served from the master

And of course, if a Binlog Server fails, we can simply repoint its slaves to the other Binlog Servers. Even better, since these hosts do not apply changes to a local dataset before serving them downstream, latency is greatly improved in comparison to using actual database servers.

We are working in collaboration with SkySQL to implement the Binlog Server as a module to theMaxScale pluggable framework. You can read this blog post by SkySQL for an introduction on MySQL replication, MaxScale and the Binlog Server.

Other use-case: avoiding deep nested replication on remote sites

The Binlog Server can also be used to reduce the problems with deep nested replication on remote sites.

If someone needs four database servers per site on two sites, the topology from Figure # 3 is a typical deployment when WAN bandwidth is a concern (E, F, G and H are on the remote site).

-----
| A |
-----
  |
  +------+------+------------+
  |      |      |            |
  V      V      V            V
-----  -----  -----        -----
| B |  | C |  | D |        | E |
-----  -----  -----        -----
                             |
                             +------+------+
                             |      |      |
                             V      V      V
                           -----  -----  -----
                           | F |  | G |  | H |
                           -----  -----  -----

Figure # 3: Remote Site Deployment with an Intermediate Master.

But this topology suffers from the problems explained above (replication delay on E slows down F, G and H, and losing F, G and H in case of a failure of E). It would be better if we could set it up like Figure # 4 but it needs more WAN bandwidth and, in case of a disaster, on the main site, the slaves of the remote site need to be reorganized in a new tree.

-----
| A |
-----
  |
  +------+------+------------+------+------+------+
  |      |      |            |      |      |      |
  V      V      V            V      V      V      V
-----  -----  -----        -----  -----  -----  -----
| B |  | C |  | D |        | E |  | F |  | G |  | H |
-----  -----  -----        -----  -----  -----  -----

Figure # 4: Remote Site Deployment without Intermediate Master.

Using a Binlog Server on the remote site, we can combine the best of both solutions (low bandwidth usage and no delay introduced by an intermediary database). The topology becomes the following:

-----
| A |
-----
  |
  +------+------+------------+
  |      |      |            |
  V      V      V            V
-----  -----  -----         / \
| B |  | C |  | D |        / X \
-----  -----  -----        -----
                             |
                             +------+------+------+
                             |      |      |      |
                             V      V      V      V
                           -----  -----  -----  -----
                           | E |  | F |  | G |  | H |
                           -----  -----  -----  -----

Figure # 5: Remote Side Deployment with a Binlog Server.

The Binlog Server (X) looks like a single point of failure in the topology from Figure # 5 but if it fails, it is trivial to restart another one. It is also possible to run two Binlog Servers on the remote site as illustrated in Figure # 6. In this deployment, if Y fails, G and H are repointed to X. If X fails, E and F are repointed to Y and Y is repointed to A.

-----
| A |
-----
  |
  +------+------+------------+
  |      |      |            |
  V      V      V            V
-----  -----  -----         / \           / \
| B |  | C |  | D |        / X \-------->/ Y \
-----  -----  -----        -----         -----
                             |             |
                             +------+      +------+
                             |      |      |      |
                             V      V      V      V
                           -----  -----  -----  -----
                           | E |  | F |  | G |  | H |
                           -----  -----  -----  -----

Figure # 6: Remote Site Deployment with two Binlog Servers.

Note that running Binlog Servers does not necessarily mean more hardware. In Figure # 6, X can be installed on the same server as E and Y on the same server as G.

Finally, those deployments (with one or two Binlog Servers) have an interesting property: if a disaster happens on the main site, the slaves on the remote site will converge to a common state (from the binary logs available on X). This makes reorganizing the slaves in a new replication tree easy:

any slave can be the new master
the binary log position of the new master is noted before sending write to it (SHOW MASTER STATUS)
the other nodes are placed as slaves of this new master at the position noted in the previous step

Other use-case: easy high availability

The Binlog Server can also be used as a high availability building brick. If we want to be able to elect a new master in case of the failure of A in Figure # 7, we can deploy GTIDs or useMHA but both have downsides.

-----
| A |
-----
  |
  +------+------+------+------+------+
  |      |      |      |      |      |
  V      V      V      V      V      V
-----  -----  -----  -----  -----  -----
| B |  | C |  | D |  | E |  | F |  | G |
-----  -----  -----  -----  -----  -----

Figure # 7: Replication Tree with 6 Slaves.

If we deploy a Binlog Server between the master and the slaves as illustrated in Figure #8:

if X fails, we can repoint all the slaves to A
if A fails, all the slaves will converge to a common state which makes reorganizing the slaves in a new replication tree easy (explained above)

-----
| A |
-----
  |
  V
 / \
/ X \
-----
  |
  +------+------+------+------+------+
  |      |      |      |      |      |
  V      V      V      V      V      V
-----  -----  -----  -----  -----  -----
| B |  | C |  | D |  | E |  | F |  | G |
-----  -----  -----  -----  -----  -----

Figure # 8: Replication Tree with a Binlog Server.

If we want to combine extreme slave scaling and high availability, we can use the topology of Figure # 9.

-----
| M |
-----
  |
  +----------------+---- ... ----+
  |                |             |
  V                V             V
 / \              / \           / \
/ I1\            / I2\         / Im\
-----            -----         -----
  |                |             |
  +------+ ...     +---- ...     +--- ... ---+
  |      |         |             |           |
  V      V         V             V           V
-----  -----     -----         -----       -----
| S1|  | S2|     | Si|         | Sj|       | Sn|
-----  -----     -----         -----       -----

Figure # 9: Replication Tree with many Binlog Servers.

If a Binlog Server fails, its slaves are repointed to the other Binlog Servers. If A fails:

we find the Binlog Server that has more binary logs (we suppose in this example that it is I2)
we repoint the other Binlog Servers to I2 as illustrated in Figure # 10
and once all the slaves have converged to a common state, we reorganize the topology

-----
| M | <--- Failed master
-----
                   /--- Most up to date Binlog Server
                   V

 / \              / \           / \
/ I1\ <--------- / I2\ ------> / Im\
-----            -----         -----
  |                |             |
  +------+ ...     +---- ...     +--- ... ---+
  |      |         |             |           |
  V      V         V             V           V
-----  -----     -----         -----       -----
| S1|  | S2|     | Si|         | Sj|       | Sn|
-----  -----     -----         -----       -----

Figure # 10: Converging Slaves after the Failure of the Master.

Conclusion

We have presented a new component that we are introducing in our replication topologies: the Binlog Server. It allows us to horizontally scale our slaves without fear of overloading the network interface of the master and without the downsides of the traditional intermediate master solution.

We think that the Binlog Server can also be used in at least two other use-cases: remote site replication and easy topology reorganization after a master failure. In a follow up post, we will describe another use-case of the Binlog Server. Stay tuned for more details.

[1] Slave reinitialization can be avoided using GTIDs or by using highly available storage for intermediate masters (DRBD or filers) but each of those two solutions brings new problems.

Memory management in Android can be a bit complicated if the application has heavy media demands. The Booking.com Android application had an issue explicitly dealing with memory management that required some ingenuity to give our users the positive experience we wanted to give them. Here is the background on how we solved one specific issue.

We are testing a new feature in our Android application – a fancy photo gallery on the Hotel details screen:

Unfortunately, this nice feature increased the memory usage by 20%. In addition to the increased consumption, swiping back and forth through the photo gallery caused a noticeable visual stutter. We debugged this quite a bit and found out that the gallery experienced a jitter when the ViewPager loaded the next image because of the garbage collection (GC) process. Our application is quite memory intensive for three main reasons: it shows a lot of photos, our view hierarchies are complex, and the amount of data to process is particularly large for popular destinations. This means every big allocation triggers a garbage collection cycle to get a free chunk of RAM.

This is especially annoying when trying to make the gallery scroll automatically without user action since Android does not bump the UI thread priority to real-time.

In debugging why our app was lagging every time we loaded some photos we took a look at what Logcat said. Basically, we had a lack of memory every time we allocated a new Bitmap:

GC_FOR_ALLOC freed 3255K, 20% free 21813K/26980K, paused 62ms, total 62ms
GC_FOR_ALLOC freed 710K, 20% free 30242K/37740K, paused 72ms, total 72ms
GC_FOR_ALLOC freed <1K, 20% free 31778K/39280K, paused 74ms, total 74ms

This is telling us that whenever we allocate a new Bitmap object for a photo it blocks the whole application for around seventy milliseconds. Seventy milliseconds is the equivalent of five skipped frames of animation. If we want to give our users a smooth experience using our application, we need to bring that number down to under sixteen milliseconds.

In order to minimize the impact from repeated triggers of garbage collection we decided to leverage the reusable Bitmap object to decode a new image. There are some hurdles to using the class, though, because an image we decode into an existing Bitmap must be the same size as previously loaded image. And of course, dimensions of the hotel photos we display can vary by hundreds of pixels. Luckily, since Android 4.4, the new image can be the same size or smaller than the old one which means we can keep a Bitmap object large enough to accommodate every photo we show.

To take advantage of this API we implemented a ViewPager adapter which keeps a small pool of Bitmap objects which should have enough memory allocated for decoding our hotel photos. Whenever an ImageView gets scrolled out of the screen, the adapter puts the corresponding Bitmap buffer back to the pool to be used for subsequent photos. Now, using that new adapter required us to resolve a couple of accompanying problems such as Bitmap life cycle management and integration with our networking framework of choice.

Keeping track of Bitmap objects

Managing the memory for Bitmaps is tricky because we have to hook into the GC mechanism. To track usage of our Bitmaps we decided to implement some reference counting.

Here is an interface for a reference counted Bitmap:

packagecom.booking.util.bitmap;importandroid.graphics.Bitmap;/** * A reference-counted Bitmap object. The Bitmap is not really recycled * until the reference counter drops to zero. */publicinterfaceIManagedBitmap{/**     * Get the underlying {@link Bitmap} object.     * NEVER call Bitmap.recycle() on this object.     */BitmapgetBitmap();/**     * Decrease the reference counter and recycle the underlying Bitmap     * if there are no more references.     */voidrecycle();/**     * Increase the reference counter.     * @return self     */IManagedBitmapretain();}

The BitmapPool keeps a collection of Bitmaps available for reuse and creates a new one if there is no Bitmap to serve a request. The pool provides its own implementation of IManagedBitmap which holds a reference to the BitmapPool that contains it. When client code releases a managed Bitmap it gets placed back into the pool instead being immediately disposed. This way the BitmapPool is aware of the leased IManagedBitmap objects and can track their life cycle.

One thing to keep in mind is that our implementation of BitmapPool is not thread-safe. That is okay though, because we always construct and destroy image views on the main thread. If we ever needed to allocate those Bitmaps in background threads, we would definitely need to introduce some synchronization to make it safe for concurrent usage.

Here is what our BitmapPool looks like:

packagecom.booking.util.bitmap;importjava.util.Stack;importandroid.graphics.Bitmap;importandroid.os.Handler;/** * A pool of fixed-size Bitmaps. Leases a managed Bitmap object * which is tied to this pool. Bitmaps are put back to the pool * instead of actual recycling. * * WARNING: This class is NOT thread safe, intended for use *          from the main thread only. */publicclassBitmapPool{privatefinalintwidth;privatefinalintheight;privatefinalBitmap.Configconfig;privatefinalStack<Bitmap>bitmaps=newStack<Bitmap>();privatebooleanisRecycled;privatefinalHandlerhandler=newHandler();/**     * Construct a Bitmap pool with desired Bitmap parameters     */publicBitmapPool(intbitmapWidth,intbitmapHeight,Bitmap.Configconfig){this.width=bitmapWidth;this.height=bitmapHeight;this.config=config;}/**     * Destroy the pool. Any leased IManagedBitmap items remain valid     * until they are recycled.     */publicvoidrecycle(){isRecycled=true;for(Bitmapbitmap:bitmaps){bitmap.recycle();}bitmaps.clear();}/**     * Get a Bitmap from the pool or create a new one.     * @return a managed Bitmap tied to this pool     */publicIManagedBitmapgetBitmap(){returnnewLeasedBitmap(bitmaps.isEmpty()?Bitmap.createBitmap(width,height,config):bitmaps.pop());}privateclassLeasedBitmapimplementsIManagedBitmap{privateintreferenceCounter=1;privatefinalBitmapbitmap;privateLeasedBitmap(Bitmapbitmap){this.bitmap=bitmap;}@OverridepublicBitmapgetBitmap(){returnbitmap;}@Overridepublicvoidrecycle(){handler.post(newRunnable(){@Overridepublicvoidrun(){if(--referenceCounter==0){if(isRecycled){bitmap.recycle();}else{bitmaps.push(bitmap);}}}});}@OverridepublicIManagedBitmapretain(){++referenceCounter;returnthis;}}}

Integrating with the networking layer

In the Booking.com Android application we use Volley to issue network requests. By default, image requests are handled by the ImageRequest Volley class which allocates a new Bitmap for every response parsed. And here is the catch. There is no way to override this behavior. We needed to implement a custom ImageRequest (ReusableImageRequest) which takes an IManagedBitmap as a destination for image decoding.

Another problem we ran into involves canceled requests. If we've already passed in an IManagedBitmap to our custom ReusableImageRequest and it gets canceled while in the request queue, we leak memory. We also needed to extend the Request class to add a new life cycle method called onFinished() to handle this case.

With our hacks in place the ViewPager adapter creates a BitmapPool and issues a ReusableImageRequest for every photo, passing an IManagedBitmap instance from the pool. ReusableImageRequest now safely takes ownership of a reusable Bitmap until the request is finished.

In a nutshell, here is the general flow of how we use Volley:

We do not provide the source code for ReusableImageRequest here because it is a substantial amount of code and most of it is replicating the original ImageRequest.

A little bonus

As long as we have a custom ImageRequest implementation we're free to apply another hack to reduce the garbage collection during the gallery scrolling. We can pass a buffer for all the temporary data required during an image decoding through BitmapFactory.Options.inTempStorage. With this option Android will use the provided, pre-allocated memory chunk for all of the image decoder variables.

Conclusion

We dramatically improved the user experience for our customers by providing a much smoother photo gallery behavior with almost no GC happening. It only required some research and a little bit ingenuity. We hope our experience will inspire Android developers to create new, exciting applications and improve existing ones.

Check out our application on Google Play!

Once upon a time, during a usability test in a low and Nether-land, a seemingly meaningless pinch of the fingers by a middle-aged Dutch man named Geert sparked a series of three experiments by a girl with golden hair.

While looking through the search results on the tablet website of Booking.com, Geert decided he wanted to see the hotel photos in a little more detail, but when he zoomed in all he got was a pixelated mess of colors! This observation lit a fire in Goldilocks, and she enthusiastically thought,

“Geert and other customers definitely want high-quality hotel photos that are beautiful and sharp!”

Viewing photos on the iPad website

My mission became as clear as an SVG icon on a retina-quality screen. I would fight for Geert and his desire to zoom. And so, during the team’s next planning meeting, I included a task to improve the experience of how the photos on the search results and landing pages were displayed.

Three of the different size property photographs

I began by digging through the treasure chest of goodies we refer to as the “tag dump” to see what sizes of images were available to use.

At the time, on both the search results and the landing pages, the website displayed performance-optimized 150 by 150 pixel hotel photos with a padding of ten pixels on each side used at their original dimensions.

The 150 pixel square photograph shown in search results

I discovered that we also had images in other dimensions such as a maximum height and width of 500 pixels, a maximum height and width of 300 pixels, and a 200 pixel square were all available.

First, I coded the square 200 pixel images in place of the square 150 pixel versions and reviewed the quality of these slightly larger images on a variety of tablet devices at different resolutions.

“Square200 photos are still much, much too blurry,” said Goldilocks. “That doesn’t solve the problem at all!”

So I moved on to the 300 pixel tall/wide images. They looked fine and appeared to be more nicely cropped, but they still weren't amazing.

The 300 pixel photographs shown in a search result

At this point, my designer-know-it-all sensibilities wanted something really, really perfect looking. After all, Geert and our other customers deserved the best that was available.

When I coded up the 500 pixel photos and refreshed the page, I was blown away. I was certain that these photos were the best solution to the problem.

“Wow!” said Goldilocks. “500 pixel photos are cropped so nicely and are so sharp they will cut through the glass of the iPads and jump out at Geert and our other customers!”

Of course, I realized that 500 pixel photos have much larger filesizes than square 150 pixel photos and that this fact would inevitably increase the time it took for the page to load. So I decided to experiment with both the 500 pixel and the 300 pixel versions, and to keep a very close eye on how the page load time was affected.

I coded up two different experiments. One experiment I created on the search results page replacing the square 150 pixel images with 500 pixel ones. And I created another experiment on the landing pages replacing the square 150 pixel with 300 pixel images.

After running both experiments for a couple of weeks the results were finally in. Both of the experiments were absolutely, conclusively, and beautifully neutral. Our customers seemingly didn't mind.

I went straight to the analytics to understand what went wrong. Initially thinking that the page load time would be a key factor in these experiments that was the first thing I checked. I found that the page load time on search results using the 500 pixel photos increased by more than half a second.

I wasn’t horribly surprised. What actually surprised me was that the experiment still ended up being neutral despite the incredible performance hit. That discovery led me to wonder whether higher quality images could, at the very least, balance out the negative impact of the increased page load time.

I moved on to the landing pages to see how page load time was affected there. The increase in page load time was negligible, but the results were still neutral. This baffled me. I stepped back from it for a couple of days and let everything subconsciously marinate.

After a few days, I was working on something in the search results completely unrelated to images when it hit me. This ten pixel padding around these photos serves absolutely no purpose. I’m going to get rid of it to make the images a little larger and clearer. It was the epiphany I needed.

The search results photos with unnecessary padding indicated

Excitedly, I mentioned the idea to my design friends.

Phil Hammel, Senior Designer @ Booking.com

Phil suggested, “You could use ‘background-size: cover’ to make the images fit even better! They can get bigger or smaller depending on the size of the block.” Brilliant!

Catalin Bridinel, Senior Designer @ Booking.com

And Catalin said: “Decreasing the overall height of the content blocks is typically a good thing from what I’ve noticed.” Great insight!

Erin Weigel, Senior Designer @ Booking.com

Finally, Goldilocks said: “And if I use the 300 pixel tall/wide photos the pictures will be cropped more nicely, but page load time shouldn’t be a big issue!”

And that was it. I coded up the experiment with smaller content blocks and the higher-quality images. Then I released it to customers. After a couple of weeks the results were conclusive. It was positive!

The search results photos shown without padding

“150 pixel squares were much too blurry. 500 pixel wide photos were far too heavy. But maximum dimensions of 300 pixel was just right!” declared Goldilocks as she enabled the experiment for all customers.

Reflecting on the process that led Goldilocks and her friends to the point of solving Geert’s problem, I realized that the initial “ideal solution” to the problem of using the 500 pixel photos because I liked the way they looked best would have ultimately been useless or even actively harmful. Without measuring how much our customers approve or disapprove of our application by using A/B testing and without the collaboration of my peers, I might have easily just changed the image quality based on my instincts and opinion.

Good design at Booking.com isn’t just about making something that seems to be better. It’s about measuring results, understanding the impact on the full customer experience, and actually making something better.

Collaboration is a key part of that process. Measuring our customers' happiness via A/B testing data is another key part. And finally, having incredible colleagues that contribute by providing their own insights and potential solutions to the problems provides the last key part. The ability to abandon ego, truly listen, and process this feedback is all part of the fairy tale magic.

(Goldilocks would like to thank Phil Hammel and Catalin Bridinel for always humoring her ridiculous conversation topics about design-and-the-like and offering incredible feedback, solutions, and insights. She’d also like to thank “Geert” wherever he may be.)

The Hamburger Icon.

Hamburger

The Hamburger Icon – it’s everywhere. All around you. In web apps, mobile sites, desktop sites, and computer software. The ubiquitous three-line icon is now so common it feels as if it is universally understood to mean "Navigation Menu." But is it?

Recently, discussion on the efficacy of the hamburger icon has reached new heights. Articles by prominent designers, and on several sites including The Atlantic, TechCrunch, The Next Web and the Nielsen Norman Group have concluded that this is a UX-antipattern, a trendy and simple to implement icon which regressed from simpler and more expressive alternatives. But antipattern or not, the use of the icon has ballooned, making it an almost ubiquitous fixture on most websites, particularly when presented on a small screen.

Given my position as a designer on the m.booking.com team, and our usage of this icon with its accompanying slide-out "drawer" menu, I decided to investigate the subject. I started by examining the origin of the icon to try to understand its path to infamy.

“The hamburger icon is a classic. Even if you don't know it by that name, its three black bars are as familiar as your mouse's cursor—a constant companion on your cyber journey since the day you got your first computer" - Gizmodo

That sounds promising. But even though the icon is a classic and has been around forever, web designers have been less than consistent in their usage of it. It has been used for lists, dragging and reordering, aligning, and plenty more. Perhaps this misuse goes some way towards explaining the criticism of it as a menu icon. Maybe through this widespread and varied usage it has lost the ability to convey a single metaphor and, in turn, leaves users confused.

This whole line of questioning started me wondering: “Are we wrong, and is everyone else right? Is this hurting our users? Do people actually understand what these little three lines are suppose to represent on our website?”

Long-time readers of this blog will not be surprised to learn that our next step was to test these questions in the form of an A/B test. Like everything else, the hamburger icon was about to be exposed to our vast customer base, which would tell us through its engagement with the menu whether the icon was the best solution. By this time, I had read enough articles and data points to be sure that no consensus existed, and that even if it did, it had not been derived from the behaviour of the customers for whom I design. I decided to follow the method outlined by James Foster and referenced by many, including one of our favourite mobile experts, Luke Wroblewski.

"I then decided to test the hamburger icon against the word MENU" https://t.co/80iJTN34jS
— Luke Wroblewski (@lukew) February 18, 2014

We had previously tested several placements and styles of the icon (with a border, without a border, with an icon, different colors, and so on), but had never tested the word ‘menu’ – a proposition complicated by our desire to test in all of our 41 supported languages. Nevertheless, we moved forward, sourced translations from our team of in-house language experts and setup the test:

Base: Our original hamburger menu icon on the left of the header and its right-aligned white border.

The original and infamous Hamburger

Variant: The word ‘menu’ inside a rounded square also aligned to the left.

The Menu in full textual and contextual glory

We ran that experiment against our whole user base, and given the prominence and omnipresence of this UI element, it didn’t take long for this change to be exposed to millions of our customers all over the world, in every supported language and across a multitude of devices.

So what was the final outcome? Would words outdo fast food as it did in James Foster’s experiment, or would the bun-and-patty win out?

Findings

In our experiment, changing the icon to the word "menu" had no significant impact on the behaviour of our users. With our very large user base, we are able to state with a very high confidence that, specifically for Booking.com users, the hamburger icon performs just as well as the more descriptive version.

Of course, we measure “all the things”, and that being the case, we can extrapolate some nice data about which countries or languages, and on which devices, this would have worked best or worst; but on a global scale we can conclude that the much derided hamburger is, on the whole, just as recognisable to our users as the word ‘menu’. In the spirit of data-driven design we should perhaps consider our options and maybe try adding cheese, fries, and a slice of bacon to our hamburger icon, but for now we are happy to report that our three-lined friend is sticking around. Its actual placement, shape, size, position, and color are of course a subject for future tests.

There is a lesson here for all of us on the nature of A/B testing. You are never solely testing a UI element, pattern, or feature. You are testing these things against a very specific user base in a very specific scenario. What works for Booking.com may not work for you and your users. This is the reason we A/B test in the first place, because the findings of others, be they expert opinion, data from other websites, or hypotheses dreamt up in the pub while eating a hamburger, are all unproven until they’ve been tested against our customers, on our platform.

Not to get lost in our own metaphor here, but like a good hamburger recipe, even if you write down all of the ingredients in the exact way I did, you will end up with a completely different burger – this is, of course, influenced by the market, the quality of meat available, the flour used on the bread, and a thousand other factors. In our view, a good idea is only a good idea for Booking.com if we can replicate it exactly on our site and if it works for all our customers.

Our opinion: It’s always worth testing your ideas and seeing for yourself what the data tells you and what questions arise. My advice? Take a bite, see what happens.

At Booking.com, we use MySQL replication extensively for read scaling. However, because we have so many slaves (we have thousands), sooner or later, one of them will fail. In MySQL 5.5 and earlier, replication was not crash-safe. MySQL 5.6 provides crash-safe replication when modifying the default configuration. But this safety comes with several downsides. Read on to find out more about crash-safe replication in MySQL 5.6 and learn how to mitigate its downsides.

MySQL replication involves two tasks on a slave: the IO thread and the SQL thread:

The IO thread downloads events from the master binary logs and appends them to the relay logs.
The SQL thread reads the events from the relay logs and executes them on the local database.

To keep track of its progress, each thread needs to store a state:

The IO thread keeps track of the next event to download from the master binary logs using two variables: a filename and a position (Master_Log_File and Read_Master_Log_Pos from SHOW SLAVE STATUS).
The SQL thread keeps track of the next event to execute using four variables: the filename and position in the slave relay logs and the two corresponding values in the master binary logs (respectively Relay_Log_File, Relay_Log_Pos, Relay_Master_Log_File and Exec_Master_Log_Pos from SHOW SLAVE STATUS).

In MySQL 5.5 and earlier, those states are persisted in files (see the master-info-file and relay-log-info-file options for more information about those files). However, the IO thread and the SQL thread cannot make progress and update their state files at the same time (atomically). This means that inconsistencies can be introduced by a crash:

If the IO thread has appended data to the relay logs but has not updated its state file when MySQL crashes, the same event is re-appended to the relay logs after restart.
If the SQL thread has committed a transaction but has not updated its state file when MySQL crashes, the same transaction is re-run after restart.

Those two inconsistencies are the cause of the infamous "Duplicate entry for key PRIMARY" errors when restarting a slave after a crash. This error happens when running an INSERT twice. Other errors can also happen on UPDATE and DELETE statements depending on many factors, the list of which is too long and complex to give here. In the worst cases silent data corruption will occur.

To solve inconsistency #2 above, MySQL 5.6 introduces the option to persist the state of the SQL thread in the database. By settingrelay-log-info-repository to TABLE (this is not the default option), the SQL thread saves its state at the same time as it commits a transaction. If the storage engine is atomic (like InnoDB), inconsistency in the state of the SQL thread is prevented.

However, it is not possible to avoid inconsistency #1. The IO thread has no way to synchronize its progress (appending events to the relay logs) and its state (saving data in a FILE or a TABLE). The consequence is that it is impossible, once a crash has occurred, to trust both the content of the relay log and the state of the IO thread.

If we cannot trust those, how can MySQL 5.6 provide crash-safe replication? The implemented solution is to:

Initialize the state of the IO thread to the position of the SQL thread (which can be trusted when relay-log-info-repository is set to TABLE and the storage engine is atomic), and
Advance the position of the SQL thread to the tail of the relay logs.

In other words, all the relay logs are considered un-proper for SQL thread consumption. This behaviour is not enabled by default. To benefit from it, the optionrelay-log-recovery must be set to 1 (the default value is 0).

In summary, to obtain crash-safe replication in MySQL 5.6, one must:

Set relay-log-info-repository to TABLE,
Set relay-log-recovery to 1, and
Use only atomic storage engines.

More information about MySQL 5.6 crash-safe replication can be found in the manual.

The problems with MySQL 5.6 Crash-safe replication

So far, everything looks good, so what are the downsides? We identified a number of problems:

When setting relay-log-recovery to 1 in the configuration file, the relay logs are considered un-proper for SQL thread consumption every time MySQL is started.
After a restart or a crash, if the IO thread had been ahead of the SQL thread, unnecessary traffic caused by downloading binary logs will be incurred on both the master and on the network.
After a restart or a crash, if the master is not available, no progress can be made by the SQL thread, even if there are plenty of unexecuted relay logs on the slave.
After a restart or a crash, if critical binary logs have been purged from the master, the slave will not be able to resume replication.
If we want to keep a copy of the relay logs on the slave by settingrelay-log-purge to 0, a restart or a crash could corrupt the stored relay logs.

Problem #1 means that all restarts must be done with extra care. If a lagging slave is restarted, it can trigger problems #2, #3, #4, and #5. Having those problems after a crash could be acceptable, but having those after a simple restart is harder to accept.

Problem #2 is the one that worries us most at Booking.com. As presented in a previous post, the network load on our master is a concern for us. The idea of a master network interface being overloaded as a result of a restart/crash of a slave is hard to accept.

A special case of problem #2 is delayed replication. In the case of a restart/crash of a delayed slave the amount of the binary log that needs to be re-downloaded is proportional to the delay of the slave. We have delayed slaves that store tens of GBs of unexecuted relay logs. A restart/crash of those would overload the network interface or a master for minutes. This is unbearable: we must find something better.

Again on problem #2, the restarted/crashed slave could be remote from the master. If the bandwidth of the WAN link is limited, it can be saturated by the slave downloading binary logs from the master. This is also an undesirable effect.

Problem #3 is less of a concern for us, but still not an enjoyable proposition (it might be more of a concern for others).

Problem #4 means that before purging binary logs, a DBA must make sure that they have been executed by all slaves. This is an additional task that DBAs could do without.

Problem #5 is more subtle since it is dependent upon the preference of our DBAs to keep a copy of the relay logs on the slave (setting relay-log-purge to 0). With relay-log-recovery=1, this copy of the relay logs cannot be trusted after a restart or a crash:

If before a restart or a crash there are unexecuted events in the relay logs, which are appended again to the relay logs after a restart, resulting in duplicate events in the relay logs (not executed in double though).
If just before an operating system crash some events are executed by the SQL thread but not synced to the relay logs by the IO thread, those events are missing from the relay logs after the restart.

The following bugs/feature requests have been logged to trace those problems:

Bug # 74089: Report Relay_Log_File and Relay_Log_Pos on relay-log-recovery.
Bug # 74321: Execute relay-log-recovery only when needed.
Bug # 74323: Avoid overloading the master NIC on crash recovery of a lagging slave.
Bug # 74324: Make keeping relay logs (relay_log_purge=0) crash safe.

Our wishes for MySQL Crash-safe replication

Do not get us wrong. Having a way to make replication crash-safe is a great step forward since MySQL 5.5. Still, some other small steps need to be carried out to have ideal crash-safe replication. For us, ideal crash-safe replication should:

Not take destructive measures if no crash has happened
Minimize the number of binary logs downloaded from the master on crash recovery
Have consistent relay logs after crash recovery when relay_log_purge=0.

Our wish #2 above can be implemented by adding four variables to the IO thread state. Those variables would indicate the last trusted (synced) position in the relay logs and the corresponding position in the master binary logs. When restarting MySQL, the relay logs would be truncated at that last synced position and the IO thread would be initialized at the corresponding position in the master binary logs. (Note that if the IO thread was stopped correctly, nothing is truncated, which satisfies our wish #1 above.) After the truncation, if the position of the SQL thread is above (or at) the truncation point (IO thread ahead of or at the same position as the SQL thread), nothing special needs to be done. If the position of the SQL thread is after the truncation point, the behaviour will depend on relay_log_purge:

If relay_log_purge is 1, all the previous relay logs are dropped and the IO thread position is set to the SQL position.
If relay_log_purge is 0, more work must be done to honour the DBA's wish to keep the relay logs. To do that, the IO thread must be started UNTIL the position of the SQL thread. When the IO thread reaches this position, the SQL thread can be started after updating its corresponding relay log position. This satisfies our wish #3 above.

Our workaround

Until ideal crash-safe replication is implemented in a future version of MySQL, we need to mitigate the problems caused by the current implementation (for our deployment at Booking.com, the biggest problem is unnecessarily downloading binary logs from the master). The solution we are currently investigating is to deploy aBinlog Server on our slaves:

The Binlog Server downloads binary logs from the master
The slave downloads the binary logs from the local Binlog Server
If the slave is restarted (after a crash or a planned restart), it will not put undesired load on the master because the binary logs will be downloaded locally

This way, we avoid the downsides of MySQL relay log recovery by pushing the problem to the Binlog Server. And because making the Binlog Server crash-safe is trivial, we successfully work-around the downsides of MySQL relay log recovery.

But if we are not careful with this approach, we could end up consuming twice the disk space that is necessary by storing binary logs in the Binlog Server directory, and storing the relay logs in the MySQL directory.

Therelay-log-space-limit option can be used to solve this problem. By setting this option to twice the max-relay-log-size (as recommended in the documentation), we put an upper limit on the extra space that can be used. But when using this option, care must be taken to set it right as changing the value will necessitate restarting MySQL [1]. There is a feature request (bug #68236) to make the relay-log-space-limit option dynamic. Hopefully this will be implemented soon.

Conclusion

As promised in our last post, we presented another case for using the Binlog Server. Hopefully, this will be a temporary use that will become obsolete. When a better relay log recovery is implemented in MySQL, we will not need to deploy additional software on our MySQL servers.

We have yet another instance where using the Binlog Server is beneficial. I will present this at Percona Live London in my talk:High Availability, Disaster Recovery and Extreme Read Scaling using Binlog Servers.

[1]: When restarting MySQL to change the option, care must be taken not to restart MySQL with relay-log-recovery=1 as this could cause the unnecessary re-downloading of binary logs. With a local Binlog Server, this becomes less of a concern.

On Friday 10th October, Booking.com hosted the first-ever combined Booking.com and Smashing Magazine event, B.Smashing, at our head office in Amsterdam.

For years we have been attending Smashing Magazine events and sponsoring conferences together. However, this was the first time we invited them to our head office with a bunch of web enthusiasts. For us at Booking.com this was a particularly special event.

The event kick-started with some drinks, nibbles, and a meet and greet. After which, we enjoyed Zoe Gillenwater’s presentation on “CSS Lessons Learned The Hard Way.” Zoe spoke about the power of failing in order to learn something thoroughly and how this enables professionals to have an in depth knowledge of their craft.

“I’ve made plenty of CSS mistakes, but I’ve learned so much from them and think I’m a better web designer because of it. I truly think we all need to take risks and make mistakes in our work from time to time, so that we can learn and grow.”

Zoe’s presentation was followed by Vitaly Friedman, co-founder and chief editor at Smashing Magazine, who spoke about “Responsive Web Design: Clever Tips and Techniques.” His presentation offered insight into his experiences, as well as offering tips and tools for web design and development in today’s multi-device world. Vitaly also put emphasis on the importance of having a well performing website, as responsive design is viewed on a mobile and data connections are not always the best.

Vitaly Friedman talking about responsive web design

“Responsive design is just a tool, but it’s a very good and powerful tool to help us cope with this multi-dimensional web that we have today.”

We love putting on these events and connecting with our industry peers — especially when we can offer something back to a community we’ve all benefited from so much in our own careers. We will be hosting the Sketch & Framer group in November and are always on the look out for more opportunities to host such events. Watch this space!

Thanks to Ivan Babko, you can also see a small video about the event!

At Booking.com we have a lot of data to manage. Most of the time MySQL does the job, but for very large datasets and heavy analytics we use Hadoop and Hive. Most of the time Hive serves our analytical needs well, but there are some cases when the built-in features are not enough. One such case is similarity analysis.

Similarity Analysis

In similarity analysis we try to quantify the similarity between different objects. This is usually done by a similarity function, which compares attributes of two objects and scores the similarity between them. There are different ways to define similarity function, but most of them are based on defining similarity as inverse function of a distance between two objects, where objects are represented as vectors of attributes. That way the similarity problem is reduced to calculation of distances between vectors in N-dimensional space.

The problem arises when the number of objects we want to compare is very large, because in order to explicitly compare all possible vector pairs, we need a nested loop which scales as O(n^2). Accordingly, when n is very large O(n^2) gets very slow. Thus, it is a problem when you have a large dataset of vectors and you need to do a similarity analysis, but you need it to run fast.

There is no canonical solution for making similarity analysis fast. It all depends on a specific case and every implementation is done to take advantage of that specific case's properties. The most common approach is to bucket the "likely similar" candidates in a clever way so that the number of comparisons is reduced. The algorithm for sampling candidates is not universal and also depends on the properties of a specific case so, naturally, there are many different bucketing algorithms, most common being different LHS schemes like MinHash and more recent ones likeDISCO and DIMSUM. Furthermore, for many common cases there are open source libraries, like Mahout , Spark, and MLib, which implement common bucketing schemes and common similarity functions like Cosine Similarity.

If your use case is a common one, like collaborative filtering, or document similarity matching based on term co-occurrence, you can use some of the mentioned solutions. However, if your use case is uncommon, then you have to implement the solution yourself.

Our Case

At Booking.com we deal with hotels. Naturally, we do similarity analysis on hotels by comparing vectors of their attributes. An attribute is some characteristic of a Hotel relevant for analysis, like for example geographical location, presence of a pool, or Wi-Fi. Depending on the type of analysis we do, there are different groups we want to compare and different similarity functions we want to use.

For example, for revenue analysis it does not make sense to compare a small Bed & Breakfast to a large hotel with 200 rooms. In this case, we can use the number of rooms as a factor in bucketing hotels to "likely similar" groups. However, for customer experience analysis, the hotel size may not matter that much. Those two properties may be similar in terms of having a low number of complaints, high review scores, good hiking options, and historical ruins near by. Therefore, depending on type of similarity we need a different, custom-made similarity function and a different bucketing scheme. In addition, we need to be able to compare fairly large buckets. In some cases it even makes sense to compare every possible combination. We want our analysts and data scientists to have that freedom when looking for data patterns. This means using a brute force approach in combination with candidate bucketing with freedom to choose the brute force ratio we want to have for a given case.

In the worst-case scenario, we deal with the comparison of every pair, for 500,000 hotels that is 250 billion vector pairs to process. We want that done in a short amount of time so it seems more like an ad-hoc query instead of a long running job.

Considering the power needed, the natural choice for this adventure was our big Hadoop cluster.

The requirements summed up in a short list are:

In order to compare everything to everything, we need to be able to deal with O(n^2).
Since we want the flexibility of comparing different attributes and experiment with different similarity functions without restriction, we need to be able to deal with heavy similarity functions.
Given O(n^2) and a heavy similarity function, the best way currently to make this fast is with massive parallel processing.
We need the solution to be easy to use. Ideally, we want to be able to say: join these two datasets and here is the similarity function to use.
From a pragmatic perspective, since we already have two large Hadoop clusters we want to take advantage of those resources.
Since we have a Hive data warehouse, the similarity join should work on Hive tables out of the box.

The hardest part in this list is making O(n^2) play nice with Hadoop MapReduce, since MapReduce does not fit well into this type of a problem. While we were aware that this is not an ideal situation, there were still major benefits of being able to use the existing infrastructure and tools we already have.

Our Solution

First thing to realize is that Hive joins are key based as they translate to MapReduce jobs, which are also, by paradigm, key based. Similarity join is not key-based. We are not pairing up records based on key matching, but instead we have custom similarity function, which scores similarity for every two pairs of records based on different attributes. Then, in the end result we want to get top N similar records. This discrepancy in the nature of similarity join and key based join means that, in theory, Hive and MapReduce are not an ideal choice for this type of problem. However, in practice, our data warehouse is stored in Hive, our analysts and data scientists are using Hive, and SQL-like syntax of Hive has a large usability advantage over writing custom Java/Scala code. All of this means that it would be very valuable for us to be able solve this problem in Hive.

Therefore, we wanted to try and see if Hive can be made to play nice with this type of a problem.

These are the different approaches we experimented with.

I. Naive approach:

Try to do a cross join in Hive and use similarity function in a select transform.

It turns out that this approach is dreadfully slow, and the main reason is that Hive by default directs all cross join record pairs to a single reducer, which cannot handle this load by itself.

Hive is designed to work with key based equality joins and the principle is that records which have a key match go to the same reducer. That makes perfect sense when we have a key condition, but in case of a cross join there is no key condition and therefore all possible combinations are a valid join-match. This means they all end up on one reducer. Hive is aware of this and warns you a cross join is happening. The problem is that we really need the cross join.

II. Brute force approach:

Since Hive needs keys, what about making an intermediate table with the IDs of all the record pairs?

It’s going be a lot of them, but after all it's Hadoop. It should be able to handle it. In return you can join tables with a key match and you can have multiple reducers. While this can actually work, it's very slow and it's really an abuse of MapReduce. If you have one million records, then the join mid-table will have one trillion keys. You will need to maintain (or create every time) one trillion records just to cross join two tables. If you want to join some other tables in addition to the original two you need to generate a new join table. In short, don’t do this.

III. Hive performance tuning approach:

What if there is a way to somehow make Hive not to force all the record combinations to a single reducer?

Well, it turns out that it is possible. You can specify the number of reducers explicitly and then use DISTRIBUTE BY to distribute map output to multiple reducers so your similarity function can be ran in parallel on multiple reducers. But, before you get too excited, there is a catch: even though Hive will distribute the join pairs to multiple reducers for further processing, Hive will first have to do the cross join to get all of the combinations that will be sent to the reducers. That process can be very slow because the number of mappers that will do this work is determined dynamically based on the size of the larger table. But with a similarity join the larger table is not that large by Hive standards. Even with one million records, which is large for a similarity join, it is still small with Hive and will give you several mappers. Obviously, several mappers are way below what we need. With a data size increase the number of mappers will become a bottleneck. To deal with this you can tweak the split sizes to increase the number of mappers, but you will have to set very low limits compared to the Hive defaults. It will, at the very least, be an unconventional usage of Hive and MapReduce. If you don't mind this tweaking and a lot of nested queries, you will end up with something like this:

setmapred.reduce.tasks=100;setmapred.min.split.size=100000;setmapred.max.split.size=200000;setmapred.min.split.size.per.node=100000;setmapred.max.split.size.per.node=200000;setmapred.min.split.size.per.rack=100000;ADDFILE./distance.pl;ADDFILE./filter.pl;SELECTTRANSFORM(id_a,id_b,distance)USING"/usr/local/bin/perl filter.pl"AS(id_a,id_b,distance)FROM(--numerically sorted distances:SELECTid_a,id_b,cast(distanceasdouble)asdistanceFROM(-- distances:SELECTTRANSFORM(id_a,id_b,attribute_a_1,...,attribute_a_N,attribute_b_1,...,attribute_b_N)USING"/usr/local/bin/perl distance.pl"AS(id_a,id_b,distance)FROM(-- All combos distributedSELECTid_a,id_b,attribute_a_1,...,attribute_a_N,attribute_b_1,...,attribute_b_N-- all pairs:FROM(SELECTA.idasid_a,B.idasid_b,A.attribute_1asattribute_a_1,...,A.attribute_Nasattribute_a_N,B.attribute_1asattribute_b_1,...,B.attribute_Nasattribute_b_NFROM(SELECTid,attribute_1,...,attribute_NFROMtable_A)ACROSSJOIN(SELECTid,attribute_1,...,attribute_NFROMtable_B)B)ALL_COMBINATIONSCLUSTERBYid_a)ALL_COMBINATIONS_DISTRIBUTED)DISTANCESDISTRIBUTEBYid_aSORTBYid_a,distance)DISTANCES_DISTRIBUTED_SORTED;

The above query does the following:

Sets the max split size to 200KB so we get a lot of mappers.
Does a cross join with multiple mappers.
Distributes all pairs to multiple reducers.
On the reduce side it uses our "distance.pl" reducer to score similarity of each pair.
At the end it uses our "filter.pl" script to filter out pairs below the desired similarity threshold.

So, we have a pure MapReduce way to run similarity joins. It works, and you can distribute distance calculation load over multiple reducers and have as many mappers as you want. However, there are some major problems with it:

Although we can spawn as many mappers as we want, they are really not doing any calculation work. They are just getting all the pairs that need to be processed. In addition, the forwarding of all those pairs to reducers takes too long. In fact, only 5-10% of the time is spent on reducer side processing the data, and 90-95% of the time is spend on Hive trying to get the data to reducers. Basically, if you want to use Hive the wrong way, there is no better candidate than the cross join.
As if the cross join issue was not enough, after the calculation of the similarity scores for all pairs, there is an additional step for filtering. This is very inefficient because for each record A it has to sort all similarity scores between record A and all of the records in table B. Aside from being a good lesson in inefficiency, there is nothing good about that. One way to do it efficiently is to remove the sort step and second reducer, and maintain a priority queue in the first reducer so we have our top N matches right after the first reducer without the need for sorting and filtering. But that would add an additional usability problem because every time you need a similarity join you have to deal with the plumbing instead of just focusing on the similarity function.
Even if we didn't have the above mentioned problems, still from the usability point of view, large nested queries like these are cumbersome and error prone in terms of developing and maintaining their code. It is quite the opposite of easy-to-use, which was also an important requirement.

In order to bring this to a satisfactory level, we needed to remove the slow useless parts (the all-pair forwarding step), keep the fast useful parts (reducers doing the actual work), and package it in a form that is easy to use.

IV. Perl steps in to help

Good packaging requires quality glue and, when it comes to glue, Perl has never let us down. We made a small Perl library that makes working with similarity joins a breeze. Also, with optimizations made, it is a lot faster than doing this with Hive alone. The optimizations made are:

Direct data distribution to reducers:
- First table is streamed directly to multiple reducers.
- Second table is cached in a distributed cache so it is available to reducers on all data nodes.
Joining the cached second table with streaming records from first table on multiple reducers in parallel.
In each reducer the priority queue is maintained and returned when the streamed key has been processed, so we get the top N similar records for the key in sorted order directly from reducer (no need for sort step).

Basically this is a distributed reduce side cross join coordinated from Perl which uses Hive and Hadoop as a convenient mechanism for data flow and distributed caching. You can use HQL to specify data sources for join, and a code distance function in Perl. In principle, you could do all of this without Hadoop and Hive if you are willing to manage all the data distribution and caching yourself, but it is much easier and more cost effective to take advantage of Hadoop Streaming and Distributed Cache for the data flow management during this process.

This is how the above example looks in Perl:

usestrict;usewarnings;useHive::SimilarityJoin::Runner;# Create a similarity join jobmy$job=Hive::SimilarityJoin::Runner->new({# Performance:nr_reducers=>100,# IN:dataset_info=>{main_dataset=>{hql=>q|                SELECT                    id,                    attribute_1,...,attribute_N                FROM                    table_A            |,},reference_data=>{hql=>q|                SELECT                    id,                    attribute_1,...,attribute_N                FROM                    table_B            |,},},# Top-N bucket size:bucket_size=>5,# OUT:out_info=>{out_dir=>'/tmp/simjoin_out',out_file=>'simjoin_result.tsv',},});$job->run();# provide a similarity/distance function in the main script;sub similarity_distance{my($rec_a,$rec_b)=@_;my($id_a,attribute_a_1,...,attribute_a_N)=@$rec_a;my($id_b,attribute_b_1,...,attribute_b_N)=@$rec_b;my$distance;# ... do your calcreturn[$id_a,$id_b,$distance];}

You can customize the distance calculation to suite your needs. For example, if we want to calculate geographical distances between cities, the similarity distance function becomes:

sub similarity_distance{my($row_a,$row_b)=@_;my($id_1,$lat_1,$long_1)=@$row_a;my($id_2,$lat_2,$long_2)=@$row_b;returnif$id_1==$id_2;requireMath::Trigqw(great_circle_distance :pi);my$earthRadius=6371;my$degToRad=pi/180.0;# convert to radiansmy$long_1_radians=$long_1*$degToRad||0;my$lat_1_radians=$lat_1*$degToRad||0;my$long_2_radians=$long_2*$degToRad||0;my$lat_2_radians=$lat_2*$degToRad||0;# sphere earth approximationmy$geo_distance=0;$geo_distance=great_circle_distance($long_1_radians,pi/2-$lat_1_radians,$long_2_radians,pi/2-$lat_2_radians,$earthRadius);return[$id_1,$id_2,$geo_distance];}

Also, you don’t need to have the library installed on the remote Hadoop data nodes. It is enough to have it on the box from where you run the script. All the HQL and reducer code will be generated at runtime and uploaded to the remote data nodes.

The source code is available on github.

Benchmark:

We used the above mentioned geo distance between two cities to compare performance between pure MapReduce approach and hybrid Hive/Perl combination approach. Depending on dataset size and number reducers, our module performs [4x-27x] times faster then ordinary MapReduce approach.

Here is the summary of the benchmark results for these two approaches:

Hive with performance tuning:
+--------------+----------------+---------+
| dataset | reducers | reducers | runtime |
|  size   | total    | parallel |         |
+---------+----------+----------+---------+
| 10.000  | 100      | 100      | 8min    |
+---------+----------+----------+---------+
| 25.000  | 500      | 500      | 54min   |
+---------+----------+----------+---------+

Perl-Hive hybrid:
+--------------+----------------+---------+
| dataset | reducers | reducers | runtime |
|  size   | total    | parallel |         |
+---------+----------+----------+---------+
| 10.000  | 100      | 100      | 2min    |
+---------+----------+----------+---------+
| 25.000  | 500      | 500      | 2min    |
+---------+----------+----------+---------+
| 50.000  | 1.000    | 500      | 3min    |
+---------+----------+----------+---------+
| 200.000 | 5.000    | 500      | 17min   |
+---------+----------+----------+---------+
| 500.000 | 25.000   | 500      | 115min  |
+---------+----------+----------+---------+

What is interesting to note is that the performance difference is not that dramatic for small datasets (Perl-Hive hybrid is only 4x times faster). But as the data grows, the performance difference becomes much more significant. For 25.000 records, the performance increase is 27 times in favor of hybrid solution. For larger datasets the non-hybrid solution was so slow that we decided not to compare at all.

At the end of the day we can process 250 billion distances in less than two hours.

End remarks

For some interesting examples of what we found by analysing our hotels, stay tuned for part two of this post.

Cheers and happy data crunching!

“We tested that, and it failed.”

This typical excuse is rampant in the world of A/B testing, but it can overlook the fact that a concept in and of itself is fundamentally different from the execution of a concept. Ideas often resurface over time. Ones that failed before tend to be labeled as failures, and they never make it out of a new gate.

Yeah, I thought of that years ago… Tried it though, and it didn’t work.

This quick shoot-down mentality can be harmful if not checked. Because Booking.com has been performing A/B tests for about a decade now, it sometimes seems that everything has already been tried. Though we've done a lot, failed many times and won a bit along the way too, there's still so much more we can improve for our customers and their experience on the site. That’s why my reaction to dismissive statements like this is typically, “OK, then… What exactly did you try and how long ago?"

Did their A/B test approach the overall idea in a similar way to the proposed new way? And if so, how long did it run? How did it affect user behavior? And are they 110% certain there were no bugs or usability issues introduced in their implementation?

There are far more ways to fail than there are to succeed.

I have a whole laundry list of questions I ask when I hear that a solid concept has failed. This list stems from my experience that there are far more ways to fail at something than there are to succeed.

That statement is rather pessimistic – and with good reason. I’ve done enough A/B tests (from generating the initial concept to implementing it from nuts-and-bolts technical perspective) to grasp the number of moving parts that could potentially lead to a good idea’s demise.

A seemingly “insignificant change” made off-the-cuff or a hard-to-identify design flaw could have an impact just negative enough to counteract whatever positive effect your change might be having.

Here are some concrete examples that
can make good ideas fail:

Increase in page-load time due to a less-than-ideal technical implementation
Did you add some big images, heavy CSS, or some poorly performing JS? Have you kept an eye on any new errors that might have cropped up? 

Keep a very close eye on all vital site statistics because changes “unseen” to users are just as impactful as content and visual changes. 

Slightly wrong choice of color, typeface, or font size on key elements
Is the most important information eye-catching and legible? 

Even something as seemingly insignificant as a serif font used in the wrong place can have a negative impact. 

Improper size or placement of a feature in relationship to other content on the page
Does the thing you’re adding or redesigning take attention away from another key element on the page? Did you remove something else to “make space” for the new feature? 

You can’t add, remove, or change anything on a page without it affecting how people interact with everything else. Designers tend to have laser-focus on the new stuff they’re doing and forget it changes how stuff around it is used, too.

Poor timing
Have you implemented a tooltip that disappears after a certain length of time? Are you adding content that only makes sense during a particular time of the year? 

Showing certain types of information too soon, too late, or not giving people enough time to absorb content can sometimes have the opposite of your intended effect.

Bugs in edge-case scenarios
Is there a rendering issue in a lesser-used browser or device? Is it just as usable on a tablet as it is on a desktop? If the website is multilingual, is everything translated properly and correctly localised? 

If enough of these edge-cases combine together, their cumulative effect could be negative. This, however, is one of the benefits of comprehensive A/B testing framework! You can use analytics to see potential user agents, browser types, and countries where a test is failing. Each user deserves to have a good experience on your website, and resolving these issues you identify in the data can push the results over the edge. A/B testing can help ensure high-quality for everyone.

Making large, small and/or unnecessary changes not inherently linked to the raw concept, which may have unintended consequences
Did you change a line of copy, while also making the words bigger/smaller or adding new color?

If so, then you’ve tainted the concept you’re testing. I’ve seen a slight increase in font size while testing a color change cause a solid and repeatable concept to fail. 

Noisy tracking
Are you testing a new flow or a different interaction? Is the content you’re adding or changing not immediately visible on the page? 

Track users only when they are actually exposed to the visual change or interact with the element that contains the different behavior. If you’re tracking people who haven’t seen or used the concept you’re testing, then they become statistical noise and dilute the results. Enough noise in your tracking and you can’t hear what your users are trying to tell you. 

Who’s the audience?
Who exactly sees the change? Is it exposed to all visitors? Did they come from an email that set a certain expectation of what they would see? Have they come from a paid ad or an organic search? Have they typed in the URL directly? Are they newcomers or returning visitors? Have they made a purchase before?  

A customer’s point of entry and their historical use of the site affects how they interact with content. The more you're able to target messaging to the most relevant users, the more likely you're able to create meaningful interactions that make metrics move.

Low traffic
Did your test include only a small group of users that were exposed to a very tiny change? 

To pick up a significant result on a low traffic website, your changes need to be bigger and bolder to see an effect. In other words, your idea might actually be working, but you just can’t see it in the numbers. The size of the changes you make need to relate to the amount of traffic you have. 

An idea ahead of its time?
When did you test this concept? Has a decent amount of time passed, but the problem still hasn’t been solved? 

What doesn’t work today might work tomorrow, and what worked yesterday might be holding you back today. 

The products we design must be just as dynamic as the people we design for.

People are dynamic, and their expectations change as they, and the world around them, evolves. That’s why the products we design have to be just as dynamic. 

Designers can tend to be ahead of the curve because we keep our fingers on the pulse of what’s going on. Design trends, the newest HTML & CSS tricks, and fancy technology integration might seem cool to us, but most of the time what we expect is far different of what typical users feel comfortable with. 

We are designing products for normal people to use TODAY.  

I always try to remind myself that I’m designing products for normal people to use TODAY. 

People who weren’t used to swiping gestures last year might expect to see them this year. A fancy line of code that used to crash browsers could solve that problem a few years later as hardware becomes more powerful.

These are just a few of the most common issues I’ve stumbled upon during my time designing with data.

That’s why gaining an incredible depth of understanding of the high-level concept you’re testing—as well as having a solid grasp on the complexity of your system coupled with a flawless implementation of the solution—is imperative to an idea’s final success (or failure).

Here are some things to keep in mind for when you set up your next A/B test:

Remember the importance of carefully navigating through complexity to cleanly test your concept.
The results, be they positive, negative, or neutral, can help form future iterations of the same concept or can offer you insights into new hypotheses to be tested.
Hold yourself to a high standard of quality—even when in an easy-win situation.
Every A/B test, in spite of its size or scope, should get the same amount of care.
Understand that a negative or neutral result doesn’t necessarily mean “no.” These results can also possibly mean, “Not quite right” or “Not quite yet.” The more you test, the more you’ll be able to spot when “no” actually means “no.”

But wait! There’s more…

Sometimes, however, a concept is so strong that it can survive even the worst of executions. I’m sure you’ve experienced examples of features or functionalities on major websites that are incredibly useful but lack visual refinement and/or have some unfortunate usability issues. This often leads me to a moment of face-palming.

Sometimes a poor implementation keeps a good idea from succeeding. Conversely, a great idea can succeed in spite of a poor implementation.

So, it goes both ways. Sometimes a poor execution keeps a great idea from succeeding, but sometimes a great idea succeeds in spite of a haphazard implementation.

The difference between “average” and “exceptional” data-driven designers is that they realize that a “concept != an execution.”

Continuous learning on the job is hard. We all see things we want to improve, but maybe we’re missing a few skills to really make an impact. With most days filled with emails and meetings, there’s often not much time left for learning, no matter how much we want to develop our skills.

Although many organizations try to remedy this issue by employing external companies to handle training, they rarely follow-up to ensure such trainings are actually value for money. Not only that, employees are often left to figure out how their new skills can be applied to daily work, and sometimes they are even left wondering if the training taught them anything useful at all.

I work at Booking.com as an engineering manager, and in my job I wanted to learn about a topic for which there was no formal training. I ended up creating a study group that became the blueprint for autonomous peer learning in our Technology department. It’s an initiative that has been scaled to 50 Peer-to-Peer (or P2P) learning groups over the last 18 months.

The premise of P2P groups is that participants take the time to think about what they want to learn and why. This means their learning is tailored from the very beginning, ensuring that it is both relevant to their work and beneficial to their organization.

What does it feel like to participate in a P2P group?

P2P groups are autonomous peer-learning study groups. Up to six participants meet every week to exchange ideas and learn about their chosen topic. There is no trainer, and participants steer the group in terms of both form and content.

I started and facilitated the very first P2P group at Booking.com, which looked at Distributed Systems. Each week, participants would agree to discuss a topic within that main field, and prepare by researching a publication, video presentation, or open source system tied into that week’s topic. Participants would present what they had learned and answer questions from others, with a vote at the end of each session to decide the topic for the following week.

Of course, there is only so much you can do in six weeks. The P2P format tends to be good for a broader overview, but less useful for in-depth learning. Nevertheless, I have seen very good outcomes from our P2P groups.

One of the participants in my very first group joined as he was already researching even better ways to store our accommodation availability data. He went on to help implement a totally new system that is now used in production, and ended up using his experience with the group to delve deeper into distributed storage challenges.

In another P2P group on the topic of Short-Format Writing, participants would write drafts for articles and give each other feedback on their writing each week. As a participant of that group, I worked on multiple drafts about autonomous learning at Booking.com, which eventually turned into the very article that you’re reading right now!

The key ideas of P2P groups

After keeping a close eye on 50 groups running over 18 months, I’ve condensed the core concepts of P2P learning into six key ideas:

Make it relevant to your organization. By shaping your learning goals into something beneficial to the success of your organization, you will contextualize what you learn and create new connections with current knowledge and skills. Ask yourself what result you want to achieve, and how your newly developed skills will help you accomplish that.
There is no trainer. Instead, everybody teaches and everybody learns. If there is a teacher in the room, the group will revert to classroom learning. If you think you need a teacher that’s fine, but then you’re doing regular training, not P2P learning.
One facilitator who is not an expert. The facilitator knows as much as any other participant about the topic, and is principally there to help with logistics, group communication, and gathering learning materials.
Keep the group small. Four to six participants maximum, including the facilitator. This guarantees that everybody has an opportunity to share.
Be ready to commit time. The group meets for two hours per week for six weeks, during working hours. Participants must be ready to spend one to three hours per week studying at home. Having a finite duration of six weeks makes participants feel more comfortable about investing their time.
Do a kick-off meeting to align expectations. Before starting the weekly sessions, the group should do a kick-off so that everyone can express their personal goals and agree on their expectations about what they want to learn, and how they want to learn it.

Of course, these key ideas are just guidelines, and you should adjust them to the needs and structure of your organization. P2P groups are all about autonomy, and are only as good as you make them. The more time and effort you put in, the more you’ll get out. The diagram below shows what a typical P2P weekly cycle looks like:

Diagram - Click to enlarge

Among the P2P groups at Booking.com so far, many have adjusted those guidelines to best fit their individual learning styles. One group, for example, decided to refresh their algorithms skills. As such, they didn’t do any studying at home, but instead picked a complex algorithmic problem every week for some hands-on programming. This enabled them to compare solutions and learn from one another.

Another group wanted to study Deep Learning. They decided they wanted to have more structure, so they followed an online class together. Another group decided that the best way for them to learn was to build something, so they started working on a side project together. This meant that each of them could pick a part of the project that focused on the skills they most wanted to learn.

Before you start: The kick-off meeting

It’s a good idea to schedule a kick-off meeting before you start the weekly sessions. This is an opportunity to find alignment among participants and clear up common logistical problems.

The meeting should include:

A round table to give all participants a chance to express their personal expectations.
A discussion on how you will make your learning relevant to the success of your organization.
An agreement on what the group will learn, and how they want to learn it.

In most kick-off meetings, potential participants are meeting for the first time and thus tend to be more accommodating with each other. There are, however, a few areas where compromise should be avoided:

If people have vacation plans that are incompatible with the level of involvement expected of the group, then it’s best they don’t participate at all. They will have a chance to join another P2P group on the same topic later.
For some topics, having beginners and more experienced participants in the same group can cause difficulties. A possible solution is to split the group by skill level.
If the group is too big, split it up. Six participants is generally the recommended maximum.

Finally, if you get the feeling that alignment was not reached at the end of your kick-off meeting, don't hesitate to do another one.

How can we learn without teachers?

By now you may be wondering, “How is learning supposed to happen without a teacher in the room?”

Most of us suffer from what I call “Classroom Stockholm Syndrome”. We have been conditioned to think that the optimal way to learn is a course led by a teacher, and P2P groups challenge this belief. In the conventional classroom setup, students may refrain from asking questions due to the fear of sounding stupid. Most of the time is spent passively listening to the teacher, and communication often happens through the teacher, and not between students.

The regularity of weekly P2P sessions creates a positive peer pressure loop that encourages participants to study at home and avoid procrastination. By having to repackage content and explain it to others during weekly sessions, participants are actively involved with the knowledge and skills they are trying to acquire.

Giving feedback in a P2P gropu on the topic of Short-Format Writing

Giving feedback in a P2P group on the topic of Short-Format Writing

Introducing autonomous learning in your organization

P2P groups have worked well at Booking.com, but that doesn’t mean they’ll work everywhere. To start off, you’ll need to take small steps and treat P2P groups as an experiment. Your goal should be to do a single pilot P2P group in order to gather data and see if that format is a good fit for your organization. If you are happy with the results, then you can begin thinking about scaling it up.

Before your start, talk to peers, your direct manager and your team to explain what P2P groups are about. Ask them what friction points they foresee and how they would solve them. Take plenty of notes when it comes to feedback and input – you never know when that extra info might prove useful.

You’ll need to get buy-in from your manager to make sure they agree with you spending time on P2P learning. The number one rebuttal you might face is scope, so make it clear that you’re only experimenting with a single group at this stage. The decision to expand to more groups can be left for later, therefore in these early stages the cost is low.

Another likely rebuttal is time investment. Why should your organization allow unsupervised training during working hours? First, if you pick a work-related topic, you will be able to invest what you learn back into your organization. Second, people who may not often interact with each other will create new bonds, making the organization stronger.

Once you have the feeling that you have enough buy-in from your manager, you can move forward with your pilot group.

Cherry-pick the participants of your pilot group

Is there a topic you really want to learn that would benefit your organization, and for which there is no formal training? Great, make it the topic of your pilot P2P group!

The mix of people you choose for your pilot group is critical. I recommend that you recruit people you consider to be “work friends”, as comfortable communication with them will be one less source of uncertainty.

Go and chat with them individually to explain the concept of P2P groups and the topic you have in mind. Once you are done with your pitch, tell them on the spot that you want them to be a part of it. Remember to mention that they should inform their own managers too.

Arrange your kick-off meeting as soon as you have a group of four to six people, including yourself. Once the group has its expectations aligned, you can start your six weeks of weekly sessions. Track your progress every week, taking notes of what goes well and what goes wrong; this will be invaluable when you reflect on your experiences.

Share your results with your organization

After reaching the end of your six-week pilot group, you should think about your learning on two levels. First, have you learned something about the topic you were interested in? Second, have you learned something about running P2P groups in the context of your organization? Spend some time asking participants for their feedback too.

If you’ve learned that P2P groups are not the right thing for you, it’s no big deal. It’s already great that you had the energy and drive to get this far. If you’ve learned that it works and if the participants ask for more, you already have an opportunity to help more people to learn.

At this stage, organize a meeting with the participants from your pilot group. As they have already been in a P2P group, they’re the perfect ambassadors to expand the initiative to more groups and topics. Ask them if there are other topics they would like to learn using the P2P format, and if they would like to be facilitators themselves.

Depending on your organization, asking for help from whoever is in charge of regular training is almost certainly a good idea. They already have communication channels in place that you can use, both to increase awareness of P2P groups and recruit participants.

Participants in a P2P group about Distributed Systems listening to a presentation

Challenges of autonomous peer learning

Of course, autonomous peer learning isn’t perfect. Over the last 18 months, our Technology department has seen 50 P2P groups on technical and non-technical topics alike. Internal feedback surveys suggest that around 20% of those groups had an ‘unsuccessful’ outcome, meaning they either failed to reach the end of their six week timeframe, or stopped due to participant dissatisfaction. The most common reasons for the initiative stalling are a lack of structure, lack of time commitment, and incompatible expectations between participants.

Although people can be motivated when they initially sign up, other work obligations may arise that prevent them from joining weekly sessions. It is common for a group of six to have only four or five participants at any given session. The best way for a group to deal with this is to meet anyway and help the absent participants catch up later. If this happens too frequently, however, it can completely derail a group.

Finally, there is the challenge of finding high quality learning materials that are a good fit with the skill level of the participants. The P2P facilitator generally ends up doing this, so facilitating a group can be daunting and time consuming. If you want a group to succeed, you need a facilitator that is determined and proactive.

Now go experiment with autonomous peer learning!

There are many ways to approach autonomous peer learning! I hope that this article has given you a glimpse into how it has worked at Booking.com, and inspired you to look into autonomous learning in your own organization.

Write a comment below if you have questions about P2P groups, or if you want to share your own experience with autonomous learning!

Thanks to Jonathan Cooper for his help editing this article, and to Manuel Ceron, Stefanie Kolle, Vladimir Smirnov, and Oleg Komarov for their feedback on early drafts.

abstract: The customers of Booking.com communicate with us through different mediums. They perform queries on our search engine, provide us with reviews about their stay and describe their opinion about different destinations. All this communication creates an abundance of textual information — and it's a key part of our job to understand it. The first step towards this goal is the recognition of the named entities (the sequences of words in the text which correspond to categories such as cities, accommodation, facilities, etc.). In this blog post, we display a comparison of different approaches that can be used in order to tackle such a Named Entity Classification task.

1. Introduction

Booking.com customers are in constant commucation with our website and provide us with a plethora of different textual information in this process. Our customers "talk" to us in all steps of their journey; from the start of their experience while posing queries to our search engine, to long after they've returned from their trip, where they provide us with feedback about their stay and information about the place they visited.

All these interactions create a vast amount of structured and semi-structured information in textual format that contain valuable information about their experience on our website, on their accommodation and on the place they visited. It is of utmost importance for us to be able to understand the information on this large set of data, the essential building blocks for everything we do at Booking.com. First and foremost, we should recognise the entities in the text. One can treat this problem as a Named Entity Classification task.

This blog post describes three prototype solutions for the task of Named Entity Classification in the context of Booking.com. The aim is to present different approaches to the classification task, analyse their implementation and compare them in a small scale prototype use case. Sample code in Python is also provided in the following sections for each model described.

Fig. 1: Searching in Booking.com

Fig. 1: Searching in Booking.com

Fig. 2: Reviews in Booking.com

Fig. 2: Reviews in Booking.com

2. Models

Three approaches were followed in order to tackle the problem of Named Entity Classification. The first approach uses Structural SVM, the second Recurrent Neural Networks with Word Embeddings and the third using Learning2Search.

For the SVM approach, MITIE is used. MITIE is an open source natural language processing library focused on information extraction from MIT. The library uses state-of-the-art statistical machine learning.

The second approach follows a different path and utilizes RNNs with Word Embeddings. The approach has been proved to be successful for the Slot Filling task by Mesnil et al.[5] in a project in which University of Montreal and Microsoft Research collaborated. The task is a Spoken Language Understanding task and its aim is to assign a label to each word given a sentence.

Last but definitely not least, Learning2Search (L2S) [1] was used. It was created by the team that has build Vowpal Wabbit and was presented by Langford and Daume in ICML2015. L2S’s strategy follows a sequential decision making process and it’s usage in Named Entity Classification was presented in the tutorial.

Before discussing further about each approach it would be useful to provide an insight on the task and the data used by the models.

3. Problem and data

In this blog post, we analyze different state-of-the-art approaches and compare them in a small Named Entity Classification task.

The goal of this example task is to recognize the following labels in textual strings:

destinations (dest)
property types (prop_type)
facilities (fac)

The model should be able to process queries like the one above and place labels to the word they consist of. For instance, in the query “hotels amsterdam wifi” we would like to have the following labeling:

hotels:property type
Amsterdam:destination
wifi:facility

We build our synthetic training data by creating combinations and permutations of the words from our corpora of destinations, facilities, property types etc. Since, this work is a prototype to explore the potential models for performing Named Entity Classification we focused only on the English language.

4.Structural SVM approach

The first approach uses Structural SVM. MITIE (https://github.com/mit-nlp/MITIE) uses Structural SVM to perform named entity classification. It is a C++ library that provides APIs in C, C++, Java, R and Python (2.7).

It is open-source and has been proven to be in par with Stanford NLP on the Name Entity Recognition task using the CoNLL 2003 corpus (testb). MITIE displayed an F1 score of 88.10% while Stanford NLP 86.31% (https://github.com/mit-nlp/MITIE/wiki/Evaluation).

It is also fast in comparison to other models that attempt to solve the task of named entity recognition.

Fig. 3: Speed of MITIE VS other approaches

Fig. 3: Speed of MITIE VS other approaches

4.1.Under the hood

Before we move the implementation details of the model, it is useful to describe how MITIE works. The library is a bit tricky to read and the examples (https://github.com/mit-nlp/MITIE/tree/master/examples do not display what is happening “under the hood” clearly.

MITIE chunks each sentence into entities and each entity is labeled by a multi-class classifier. In order to classify each chunk, MITIE creates 500K dimensional vector which is the input to the multi-class classifier. The classifier learns one linear function for each class plus one for the “not an entity class”. The feature extraction source code can be found in the ner__feature_extraction.cpp file (https://github.com/mit-nlp/MITIE/blob/master/mitielib/src/ner_feature_extraction.cpp). It uses the Dlib toolkit(https://dlib.net/) which is used in C++ for machine learning.

Some of the features are the following:

Does it contain numbers?
Does it contain letters?
Does it contain letters and numbers?
Does it contain hyphens?
Does it have alternating capital letters in the middle?

If one has defined N classes, the classifier has 500K*(N+1) values to learn.

4.2.Code for MITIE

For the prototype the Python binding was used and the example code on how to perform named entity recognition that can be found in github (https://github.com/mit-nlp/MITIE/blob/master/examples/python/ner.py)

The library is language dependent and needs to learn the characteristics about a language in order to operate. It comes with already analysed English and Spanish models. The English one was used for the prototype (ner/mitie/MITIE-models/english/tota_word_feature_extractor.dat). The usage of an internal textual dataset from Booking.com data could be beneficial to the model.

If one would like to build such feature extractor, the wordrep tool can be used in a simple statement like:

wordrep -e a_folder_containing_only_text_files

Since all the above is always more worthy with some actual code, the following script could help someone to use the model. The construction of input to the model is simple: we have to define the range of each token within the sentence and the label that is assigned to it.

For instance, let’s assume that we have the example “hotel amsterdam wifi”. The code below displays how to add it to the trainer.

5.RNNs with Word Embeddings

Mesnil et al.[5] have displayed the performance of Reccurent Neural Networks with Word Embeddings in one of the major spoken language understanding problems (SLU); slot filling. They implemented the Elman-type RNN [2] and Jordan-type RNN [3].

5.1.The slot filling task

A lot of research has been conducted in the Semantic parsing in SLU. It comprises three well-defined tasks: domain detection, intent determination and slot filling. The majority of the approaches on the slot filling tasks attempt to perform sequence classification. Approaches that are based on Conditional Random Fields [4] have proven to be successful in the task.

A classic benchmark for this task has been the ATIS (Airline Travel Information System) dataset which has been collected by DARPA. The dataset follows the Inside Outside Beginning (IOB) representation https://en.wikipedia.org/wiki/Inside_Outside_Beginning.

An example from this dataset is the following.

Input: show flights from Boston to New York today
Output
- show: null
- flights: null
- from: null
- Boston: B-dept
- to: null
- New: B-arr
- York: I-arr
- today: B-date

5.2.Word Embeddings

Word Embeddings have been recently receiving more publicity since Google’s Word2Vec https://code.google.com/archive/p/word2vec/ uses them. Words are mapped to real-valued embedding vectors using corpus/vocabularies of text in an unsupervised way.

5.3.Context window

A word-context window is used in order to capture short-term temporal dependencies. The context window is used because there is no temporal feedback.

The context-window consists of the ordered concatenation of word embedding vectors. For instance, the following is an example for a context window of size of 3: $$w(t)=[hotel,\textbf{amsterdam},wifi]$$ $$\textbf{`amsterdam'} \rightarrow x_{amsterdam} \in R^d$$ $$w(t) \rightarrow x(t) = [x_{hotel},x_{\textbf{amsterdam}},x_{wifi}] \in R^{3d}$$

where $w(t)$ is the 3 word context window around the i-th word ‘amsterdam’, $x_{\textbf{amsterdam}}$ is the embedding vector of ‘amsterdam’, and $d$ is the dimension of the embedding vector.

$X(t)$ is the ordered concatenated word embeddings vector for the words in $w(t)$.

5.4.Two types of RNNs

Two variants of RNNs for modeling the slot sequences were used in the paper by Mesnil et al. [5]. In Elman-type RNN the output from the hidden layer at time t-1 is kept and fed back to the hidden layer at time t; this adds some kind of “virtual context nodes” in the process and enables the network to maintain and learn a summary of the past inputs. This enables the network to perform sequence-prediction that the standard feed-forward neural network cannot do.

Jordan-type RNNs are similar to Elman-type with the difference being on the use of context nodes. In Jordan-type the context nodes are fed from the output layer and not from the hidden as in Elman.

5.5.Results on Slot Filling

Mesnil et al. compared their approached against Logistic Regression models, CRF and Multilayer Perceptron. The major points of their results are the following:

Modes that use the sequential dependency outperform the models that do not
RNN models perform consistently better than the CRF model
Elman-type RNN’s that use past information performs very well, but the Elman-type RNN that uses future information does not, despite them being symmetric to each other. This can be explained due to the format of the ATIS dataset that has most of the information in the second half of the sentences.
For Elman-RNN the best window size was 3 for the forward model and 13 for the backward.
Jordan-type RNN provide to be more robust, especially the bi-directional version of it.

Table 1: RNN results on Slot Filling task

5.6.Code for RNN

Microsoft research released some code related to their work in githubhttps://github.com/mesnilgr/is13. In order to run, the code needs Theano (https://deeplearning.net/software/theano/). For this prototype, the Elman-type RNN was used using the sample code from the repository (https://github.com/mesnilgr/is13/blob/master/rnn/elman.py).

A similar process of building the training data as in MITIE has to be followed. The following script provides a sample code on how to train with the RNN.

6.VW Learning2Search

John Langford (Microsoft Research) and Hal Daume III (University of Maryland) presented the Learning2Search [2] approach in their tutorial in “Advances in Structured Prediction” in ICML2015 (https://hunch.net/~l2s/merged.pdf).

Learning to search method for solving complex joint prediction problems based on learning to search through a problem-defined search space.

The major difference of the learning2search (L2S) to the rest of models used in the state-of-the-art is on the way it approaches the task of structured prediction. The majority of the state-of-the-art approaches can be characterised as “global models”, having the advantage that they have clean underlying semantics and the disadvantage that they are computationally costly and introduce difficulties in the implementation. On the other hand, L2S treats the problem as a sequential decision making process.

Sequential decision making approaches have been recently used in dependency parsing and a few toolkits for NLP have been published such as nlp4j (https://github.com/emorynlp/nlp4j) from Emory University, and MaltParser (https://www.maltparser.org/) from Växjö University and Uppsala University.

6.1.Learning2Search vs Other Approaches

The goal for Learning2Search was to create a model that has the following characteristics:

Lower programming complexity
Good prediction accuracy
Efficiency in terms of both train and test speed

The following graph displays a comparison in terms of lines of code between Conditional Random Field approaches (CRFSGD, CRF++) and Structured SVM (S-SVM) and Learning2Search.

Fig. 4: Programming Complexity of L2S VS State-of-the-art

Fig. 4: Programming Complexity of L2S VS State-of-the-art The following graph displays a comparison in terms of accuracy and training time between Conditional Random Field approaches (CRFSGD, CRF++), Structured Perceptron(https://en.wikipedia.org/wiki/Structured_prediction), Structured SVM (S-SVM) and Learning2Search.

Fig. 5: Training time and accuracy of L2S VS State-of-the-art

Fig. 5: Training time and accuracy of L2S VS State-of-the-art

The following graph displays a comparison in terms of prediction time between Conditional Random Field approaches (CRFSGD, CRF++), Structured Perceptron( https://en.wikipedia.org/wiki/Structured_prediction), Structured SVM (S-SVM) and Learning2Search.

Fig. 6: Training time and accuracy of L2S VS State-of-the-art

Fig. 6: Training time and accuracy of L2S VS State-of-the-art

6.2.Code for VW L2S

Along with the ICML2015 tutorial, an iPython Notebook for L2S wasreleased. Be aware that one has to remove the –audit command in line 22 because it crashes the program.

Following is sample code that one can use for employing Learning2Search to our the Named Entity Classification problem.

7.Results

As described earlier, a small use case was built for the comparison of the three approaches. The top 10% of clicked destinations were used to build a sample dataset for the prototype models presented. The task for the models was to recognize the following labels:

destination (dest)
facility (fac)
property type (prop_type)

The whole dataset created using the different combinations of destinations, facilities and property types was around 200,000 rows and 20% of it was used as a test set to evaluate the models. Table 2 displays the results for all the three approaches for Named Entity Classification in our case. The major points from the comparison are the following:

L2S is by far the best model.
Structured-SVM (MITIE) performed better than the Elman RNN

It is also worthy to mention that the demand of resources between the three approaches has been vastly different. L2S has been by far the less demanding. It was also the faster model in terms of training time. Both the Structural SVM and RNN required close to 20 hours training using almost 100GB of memory, while L2S run in a Macbook Pro in around 15 minutes using 1GB of memory.

Table 2: Comparison of MITIE, Elman and L2S

8.Discussion and Conclusion

In this blog post, we displayed three different approaches for Named Entity Classification. An example problem was used in order to perform a comparison of them. There are different use cases that such models can be applied inside Booking.com and the models can be easily adjusted to recognise different kinds of labels.

It is important to notice that the models do solve the classification problem, and not the mapping problem/strategy of the terms to the inventory of Booking.com.

Summing up, one thing is clear: Named Entity Classification could really help tackle this problem by providing a better understanding of the various textual inputs of our customers, and fits well with the top priority of applied Data Science in Booking.com - to enhance the experience and satisfaction of Booking.com customers.

References

[1] Chang, K.-W., He, H., Daumé III, H., and Langford, J. Learning to search for dependencies. arXiv preprint arXiv:1503.05615 (2015).

[2] Elman, J. L. Finding structure in time. Cognitive science 14, 2 (1990), 179–211.

[3] Jordan, M. I. Serial order: A parallel distributed processing approach. Advances in psychology 121 (1997), 471–495.

[4] Lafferty, J., McCallum, A., and Pereira, F. C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.

[5] Mesnil, G., He, X., Deng, L., and Bengio, Y. Investigation of recurrent-neural-network architectures and learning methods for spoken lan- guage understanding. In INTERSPEECH (2013), pp. 3771–3775.

It has become more and more common to vocally declare oneself in favor of diversity, notably in tech. I do not think, however, that what diversity actually brings is well understood across the industry. Diversity of all types and an open, dynamic workplace has been a pillar at Booking.com since day one, and something we continue to strongly believe in and build today.

At one level, having a diverse team makes it easier to build products for a larger and more diverse audience from all over the world. But the concept of diversity is far from simply contributing to the effectiveness of a business organization.

The individuals

Diversity and embracing diversity are much more profound. First and foremost, it is important for each person to be accepted. In a workplace where diversity is recognized and highly regarded there is less emphasis on conforming to a supposedly standard model of behavior. People are freer to just be themselves. Conversely, playing a role and trying to blend in can take a large, underestimated toll on our cognitive resources. Remove that and people will be more open, more creative, and more willing to bring in ideas and critiques, instead of trying to disappear in the background for fear of being singled out as not fitting the expectations of the loudest or the dark matter of unconscious or unspoken prejudice.

The teams

This is not all. In a diverse team people naturally become more attuned to the nuances of communication. What is left untold can no longer be assumed; it must effloresce, because the shared cultural background is much smaller. Therefore, the risk of having teams becoming dysfunctional because of miscommunication decreases.

The communication

Speaking of communication: valuing diversity also means taking active steps to ensure that our work environment is safe for everyone. It is far too easy—easier than many realize—to create a toxic work environment through disrespectful communication and lack of empathy.

By definition, any work environment is a small, closed group from which it’s difficult to walk away. Inside such a group, everyone should have an equal opportunity to be part of the group conversation with nobody feeling diminished or a less worthy contributor. Freedom of expression, often used as a justification to say anything, even offensive, can also though be the cover for a fraction of the group dominating all expression, even unintentionally.

For example, not everyone feels empowered to answer to a remark that shocks them. If you do feel empowered to respond, it's likely that you are already part of that more vocal, dominant fraction of the group (which is not always the majority). Therefore, issuing a shocking or hurtful remark, can effectively mean cornering other colleagues into silence, even if unwillingly. (And remember that locker-room banter always comes out eventually.) This is OK still; failing is acceptable as long as we fail fast and learn (one of our core mantras at Booking.com). In other words, if someone speaks up denouncing an affront, a fast acknowledgement and apology reinforces the message that speaking up is welcome and valued by all of us.

The community

Finally, diversity is valuable for our own, personal sake, apart from being in the interest of the company we work for. We spend a large part of our lives at work. Why should work be, at all, about interacting with rows of faceless clones, all alike? Why wouldn't you choose colleagues that will teach you, challenge you, enrich you, and make you better instead? Why wouldn't we encourage you to contribute in the same ways?

Be humble, open, and friendly and remember that our diversity gives us strength. After all, this is one of our cornerstone values at Booking.com.

We saw in the previous parts of this blog series how togather, aggregate andstore events in Riak, and how to fetch them for external processing. In this post, we'll see how to use post-commit hooks to apply transformations to the events data stored in Riak, without using MapReduce jobs.

If you missed Part 3...

We strongly recommend that you readUsing Riak as Events Storage - Part 3 of this blog series, to understand why MapReduce doesn't fit our needs. The previous parts explains howBooking.com collects and stores events from its backend into a central storage, and how we use it to do events analysis.

Strategy and Features

The previous parts introduced the need for data processing of the events blobs that are stored in Riak in real-time, and the strategy of bringing the code to the data:

Using MapReduce for computing on-demand data processing worked fine but didn't scale to many users (see part 3).

Finding an alternative to MapReduce for server-side real-time data processing requires listing the required features of the system and the compromises that can be made:

Real-time isolated data transformation

As seen in the previous parts of this blog series, we need to be able to perform transformation on the incoming events, with as little delay as possible. We don't want any lag induced by a large batch processing. Luckily, these transformations are usually small and fast. Moreover, they areisolated: the real-time processing may involve multiple types and subtypes of events data, but should not depend on previous events knowledge. Cross-epoch data processing can be implemented by reusing the MapReduce concept, computing a Map-like transformation on each events blobs by computing them independently, but leaving the Reduce phase up to the consumer.

Performance and scalability

The data processing should have a very limited bandwidth usage and reasonable CPU usage. However, we also need the CPU usage not to be affected by the number of clients using the processed data. This is where the previous attempt using MapReduce showed its limits. Of course, horizontal scalability has to be ensured, to be able to scale with the Riak cluster.

One way of achieving this is to perform the data processing continuously for every datum that reach Riak, upfront. That way, client requests are actually only querying the results of the processing, and not triggering computation at query time.

No back-processing

The data processing will have to be performed on real-time data, but no back-processing will be done. When a data processing implementation changes, it will be effective on future events only. If old data is changed or added (usually as a result of reprocessing), data processing will be applied, but using the latest version of processing jobs. We don't want to maintain any history of data processing, nor any migration of processed data.

Only fast transformations

To avoid putting too much pressure on the Riak cluster, we only allow data transformation that produces a small result (to limit storage and bandwidth footprint), and that runs quickly, with a strong timeout on execution time. Back-pressure management is very important, and we have a specific strategy to handle it (see "Back-pressure management strategy" below)

The solution: Substreams

With these features and compromises listed, it is now possible to describe the data processing layer that we ended up implementing at Booking.com.

This system is called Substreams. Every seconds, the list of keys of the data that has just been stored is sent to a companion app - a home-made daemon - running on every Riak node. This fetches the data, decompresses it, runs a list of data transformation code on it, and stores the results back into Riak, using the same key name but with a different namespace. Users can now fetch the processed data.

A data transformation code is called a substream because most of the time the data transformation is more about cherry-picking exactly the needed fields and values out of the full stream, rather than performing complex operations.

The companion app is actually a simple pre-forking daemon with a Rest API. It's installed on all nodes of the cluster, with around 10 forks. The Rest API is used to send it the list of keys, and wait for the process completion. The events data doesn't transit via this API; the daemon is fetching itself the key values from Riak, and stores the substreams (results of data transformation) back into Riak.

The main purpose of this system is to drastically reduce the size of data transferred to the end user by enabling the cherry-picking of specific branches or leaves of the events structures, and also to perform preliminary data processing on the events. Usually, clients are fetching these substreams to perform more complex and broader aggregations and computations (for instance as a data source for Machine Learning).

Unlike MapReduce, this system has multiple benefits:

Data decompressed only once

Deserialisation and decompression is done once, for many data processing jobs

A given binary blob of events (at mot 500K of compressed data) is handled by one instance of the companion app, which will decompress it once, then run all the data processing jobs on the decompressed data structure in RAM. This is a big improvement compared to MapReduce, the most CPU intensive task is actually to decompress and deserialise the data, not to transform it. Here we have the guarantee that data is decompressed only once in its lifetime.

Transformation at write time, not at query time

Unlike MapReduce, once a transformation code is setup and enabled, it'll be computed for every epoch, even if nobody uses the result. However, the computation will happen only once, even if multiple users request it later on. Data transformation is already done when users want to fetch the result. That way, the cluster is protected against simultaneous requests of a big number of users. It's also easier to predict the performance of the substreams creations.

Hard timeout - open platform

Data decompression and transformation by the companion app is performed under a global timeout that would kill the processing if it takes too long. It's easy to come up with a realistic timeout value given the average size of event blobs, the number of companion instances, and the total number of nodes. The hard timeout makes sure that data processing is not using too many resources, ensuring that Riak KV works smoothly.

This mechanism allows the cluster to be an open platform: any developer in the company can create a new substream transformation and quickly get it up and running on the cluster on its own without asking for permission. There is no critical risk for the business as substreams runs are capped by a global timeout. This approach is a good illustration of the flexible and agile spirit in IT that we have at Booking.com.

Implementation using a Riak commit hook

In this diagram we can see where the Riak commit hook kicks in. We can also see that when the companion requests data from the Riak service, there is a high chance that the data is not on the current node and Riak has to get it from other nodes. This is done transparently by Riak, but it consumes bandwidth. In the next section we'll see how to reduce this bandwidth usage and have full data locality. But for now, let's focus on the commit hook.

Commit hooks are a feature of Riak that allow the Riak cluster to execute a provided callback just before or just after a value is written, using respectively pre-commit and post-commit hooks. The commit hook is executed on the node that coordinated the write.

We set up a post-commit hook on the metadata bucket (the epochs bucket). We implemented the commit hook callback, which is executed each time a key is stored to that metadata bucket. Inpart 2 of this series, we explained that the metadata is stored in the following way: - the key is <epoch>-<datacenter_id>, for example: 1413813813-1 - the value is the list of data keys (for instance 1413813813:2:type3::0)

The post-commit hook callback is quite simple: for each metadata key, it gets the value (the list of data keys), and sends it over HTTP in async mode to the companion app. Proper timeouts are set so that the execution of the callback is capped and can't impact the Riak cluster performance.

Hook implementation

First, let's write the post commit hook code:

metadata_stored_hook(RiakObject)->Key=riak_object:key(RiakObject),Bucket=riak_object:bucket(RiakObject),[Epoch,DC]=binary:split(Key,<<"-">>),MetaData=riak_object:get_value(RiakObject),DataKeys=binary:split(MetaData,<<"|">>,[global]),send_to_REST(Epoch,Hostname,DataKeys),ok.send_to_REST(Epoch,Hostname,DataKeys)->Method=post,URL="https://"++binary_to_list(Hostname)++":5000?epoch="++binary_to_list(Epoch),HTTPOptions=[{timeout,4000}],Options=[{body_format,string},{sync,false},{receiver,fun(ReplyInfo)->okend}],Body=iolist_to_binary(mochijson2:encode(DataKeys)),httpc:request(Method,{URL,[],"application/json",Body},HTTPOptions,Options),ok.

These two Erlang functions (here they are simplified and would probably not compile), are the main part of the hook. The function metadata_stored_hook is going to be the entry point of the commit hook, when a metadata key is stored. It receives the key and value that was stored, via the RiakObject, uses its value to extract the list of data keys. This list is then sent to the companion damone over Http using send_to_REST.

The second step is to get the code compiled and Riak setup to be able to use it is properly. This is described in the documentation aboutcustom code.

Enabling the Hook

Finally, the commit hook has to be added to a Riak bucket-type:

riak-admin bucket-type create metadata_with_post_commit \'{"props":{"postcommit":["metadata_stored_hook"]}'

Then the type is activated:

riak-admin bucket-type activate metadata_with_post_commit

Now, anything sent to Riak to be stored with a key within a bucket whose bucket-type is metadata_with_post_commit will trigger our callbackmetadata_stored_hook.

The hook is executed on the coordinator node, that is, the node that received the write request from the client. It's not necessary the node where this metadata will be stored.

The companion app

The companion app is a Rest service, running on all Riak nodes, listening on port 5000, ready to receive a json blob, which is the list of data keys that Riak has just stored. The daemon will fetch these keys from Riak, decompress their values, deserialise them and run the data transformation code on them. The results are then stored back to Riak.

There is little point showing the code of this piece of software here, as it's trivial to write. We implemented it in Perl using a PSGI preforking web server (Starman). Using a Perl based web server allowed us to also have the data transformation code in Perl, making it easy for anyone in the IT department to write some of their own.

Optimising intra-cluster network usage

As seen saw earlier, if the commit hook simply sends the request to the local companion app on the same Riak node, additional bandwidth usage is consumed to fetch data from other Riak nodes. As the full stream of events is quite big (around 150 MB per second), this bandwidth usage is significant.

In an effort to optimise the network usage, we have changed the post-commit hook callback to group the keys by the node that is responsible for their values. The keys are then sent to the companion apps running on the associated nodes. That way, a companion app will always receive event keys for which data are on the node they are running on. Hence, fetching events value will not use any network bandwidth. We have effectively implemented 100% data locality when computing substreams.

Better implementation where metadata is sent to the Riak node that contains the data

This optimisation is implemented by using Riak's internal API that gives the list of primary nodes responsible for storing the value of a given key. More precisely, Riak's Core application API provides the preflist() function: (seethe API here) that is used to map the result of the hashed key to its primary nodes.

The result is a dramatic reduction of network usage. Data processing is optimised by taking place on one of the nodes that store the given data. Only the metadata (very small footprint) and the results (a tiny fraction of the data) travel on the wire. Network usage is greatly reduced.

Back-pressure management strategy

For a fun and easy-to-read description of what back-pressure is and how to react to it, you can read this great post by Fred Hebert (@mononcqc):Queues Don't Fix Overload.

What if there are too many substreams, or one substream is buggy and performs very costly computations (especially as we allow developers to easily write their own substream), or all of a sudden the events fullstream change, one type becomes huge and a previously working substream now takes 10 times more to compute?

One way of dealing with that is to allow back-pressure: the substream creation system will inform the stream storage (Riak) that it cannot keep up, and that it should reduce the pace at which it stores events. This is however not practical here. Doing back-pressure that way will lead to the storage slowing down, and transmitting the back-pressure upward the pipeline. However, events can't be "slowed down". Applications send events at a given pace and if the pipeline can't keep up, events are simply lost. So propagatingback-pressure upstream will actually lead to load-shedding of events.

The other typical alternative is applied here: doing load-shedding straight away. If a substream computation is too costly in CPU time, wallclock time, disk IO or space, the data processing is simply aborted. This protects the Riak cluster from slowing down events storage - which after all, is its main and critical job.

That leaves the substream consumers downstream with missing data. Substreams creation is not guaranteed anymore. However, we used a trick to mitigate the issue. We implemented a dedicated feature in the common consumer library code; when a substream is unavailable, the full stream is fetched instead, and the data transformation is performed on the client side.

It effectively pushes the overloading issue down to the consumer, who can react appropriately, depending on the guarantees they have to fulfill, and their properties.

Some consumers are part of a cluster of hosts that are capable of sustaining the added bandwidth and CPU usage for some time.
Some other systems are fine with delivering their results later on, so the consumers will simply be very slow and lag behind real-time.
Finally, some less critical consumers will be rendered useless because they cannot catch up with real-time.

However, this multitude of ways of dealing with the absence of substreams, concentrated at the end of the pipeline, is a very safe yet flexible approach. In practice, it is not so rare that a substream result for one epoch is missing (one blob every couple of days), and such blips have no incidence on the consumers, allowing for a very conservative behaviour of the Riak cluster regarding substreams: "when in doubt, stop processing substreams".

Conclusion

This data processing mechanism proved to be very reliable and well-suited for our needs. The implementation required surprisingly small amount of code, leveraging features of Riak that proved to be flexible and easy to hack on.

This blog post ends the series about using Riak for event storing and processing at Booking.com. We hope you liked it !

It’s Thursday afternoon. For two hours, a group of designers from different teams get together in one of the meeting rooms in Booking.com’s Amsterdam office to furrow their brows and drink phenomenal amounts of coffee. Once the caffeine kicks in, they get down to the meeting’s true purpose: to run a critical exploration of a presented design coming from one or two of our designers, discussing their work rigorously to explore what works, and what might be improved.

This is the ‘Design Lab,’ the essential critiquing phase of our design process at Booking.com.

These sessions allow us to give thoughtful and actionable feedback. It helps designers seeking feedback to receive it in a focused way, providing them with a clear path forward. From these sessions, designers uncover new insights that can help direct their design work through a critical exploration of their work.

At Booking.com, we’ve designed these critique sessions so that they tackle the following:

Identify problems in our products and address these problems early on, well before it becomes too difficult to approach these problems in a different way.
Provide designers with insights that can help them change the course of their design.
Provide support for each other across the entire Booking.com design community.
Improve the quality of our work over time through these sessions.

In this article, I’ll share how you can use our design critique method to achieve the same objectives in your design team or your own design work.

Make your intention known

Start by sending out a message of your intention to run a design critique session to your colleagues. Our design community has an email list, so we usually do this by sending out an email. If it is the first time you are going to have a critique session, it’s important to include the following in your communication:

What is a design critique?
Why are design critiques important?
What will you cover in the design critique session?
What is the expected outcome?

Design critiques are not limited to designers only; you could extend the invite to other relevant stakeholders, such as engineers, product managers, copywriters, etc.

Key stakeholders

The key stakeholders in a design critique session are the presenter, the moderator, and the note master:

The presenter: This is the designer offering up their work to be critiqued. They present the problem they are trying to solve, and their proposed design solution. The presenter should come with a clear vision of what they want feedback on.
The moderator: This is usually (but not restricted to) a senior designer who is responsible for leading and driving the critique sessions. This role requires a deep understanding of UX methods, strategy, facilitation and negotiation, and is more about facilitation than dictation.
The note master: Since the presenter should focus on the feedback and the conversation going on during the critique session, you need the note master. Their main role is to take notes on behalf of the presenter.

Design critique is very much a collaborative effort and everyone should feel comfortable making sure that feedback is being given in the right way, and that the rules of engagement (see below) are followed. It’s worthwhile to note that you don’t want to leave it all up to the moderator to lead and facilitate. It’s up to everyone in the design critique session to call each other out when they aren’t giving the right feedback.

Find the right space

Find a day that works well for you and your colleagues. Make sure there’s at least 5 (and not more than around 15) participants for a two hour critique session. Get a venue that can accommodate all the participants comfortably. We run our critique sessions for two hours and therefore we strive to have an ambience that allows for participants to be engaged with as few distractions as possible.

On the day of the design critique, remember to throw in some snacks and refreshments — you want the participants attentive at all times. This helps minimise movement in and out of the room unless necessary.

Rules of Engagement

We have guiding rules for our design critique sessions. We call them “rules of engagement.” The rules of engagement should be shared well in advance. This prepares the participants, helps them get into the right mindset and helps manage expectations. These rules are carefully curated to enhance the smooth running of the design critique sessions and are designed to ensure we get the best out of all our time.

Rules of engagement:

Leave opinions at the door (we’ll explain this below).
Prescriptive solutions or ideas are avenues of thought, not direction to be taken by the designer offering up their work to be critiqued.
All phones on silent mode, no laptops allowed.
Candour is both welcomed and encouraged.

Run the critique session

Before starting the session, everyone in the room should agree that they will not use their mobile phones or laptops during the critique session. They should commit to staying in the room for the entirety of the session. Listening, asking questions and concentration can not be accomplished if one is constantly distracted by checking your phone or replying to email. Keeping phones away and laptop closed provides a space for discussion and ensures focus in the room.

The moderator then takes charge leading and driving the session. They start by reiterating the “rules of engagement” and stressing the importance of staying within the rules. Throughout the session, the moderator will make sure that all the participants adhere to the rules — but always remember that they (the moderator) is only there to help facilitate the session, not to dictate what comes out of it.

The presenter then takes over. The main task of the presenter at this point is to present their work while touching these two main points:

What, who, and why

What problem is the design trying to solve?
Who is the target user?
Why is it important to solve this problem for the user?
Why is it important for the business to solve this problem?

Objective

This is a clear vision of what the presenter wants feedback on. Without a vision, everyone may be working from different assumptions and the session may turn into more of a brainstorm rather than a design critique.

An example of a good objective: Obtain specific kinds of feedback from participants about a set of different design approaches for a feature/area of your website.

The participants are then allowed to run a critical exploration of the presented design by asking questions, while strictly adhering to rule number one of engagement — leave opinions at the door.

The intent of design critique sessions is discussion and exploration of the work presented. For this reason, we discourage opinions as they often close down discussion by limiting the presenter’s responses instead of opening them up.

Opinions include statements that start with “I don’t think…” or “I don’t like…” along with their counterparts, “I think…” , “I like…”, “I feel..”.

Here are some opinionated examples:

“I don’t think you need that icon next to the second item on the left menu” or “That arrow icon on the ‘ inbox’ menu item does not look good. Use an envelope icon instead”.

On the other hand, we encourage open-ended questions to open up exploration of design. Open-ended questions provide the participants with an opportunity to gain insight on the design process and the thinking behind the design being presented.

With this in mind, instead of saying:

“I don’t think you need that icon next to the second item on the left menu”

one could rephrase this:

“Can I ask why the icon is placed on the second item on the left menu?”

And instead of:

“That arrow down icon on the ‘ inbox’ menu item does not look good. Use an envelope icon instead”

one could rephrase:

“Have you thought about using an envelope icon instead of the arrow down on the ‘inbox’ menu item?”

During the critique session, it’s best to encourage standing up. Also, participants should not be afraid of getting up close and personal with the work being presented.

While the exploration goes on, the note master writes down all the points raised by the participants. They are allowed (where necessary) to ask for clarifications. Notes should be clear enough and easy to understand.

At the end of the critique session, the presenter is given the final set of notes and returns to work with his team on moving the design forward.

The note master’s board. Notes should be legible, straight and to the point. Presenters may prefer to take a photo of the notes with them. Lastly, while exploration can often include offering up ways to solve a design problem, it’s expected that all solutions will come from the presenter, since they are the one ultimately responsible for the work. Therefore, at the end of the critique session, all prescriptive solutions or ideas that lead to specific solutions will be taken only as avenues of thought, not direction to be taken by the designer offering up their work to be critiqued.

If you run your design critique according to these guidelines, you’ll probably find that you get through it quickly, efficiently, and everyone — the presenter especially — will feel satisfied about the outcome. Taking your critiques seriously means you’re taking design seriously. And that’s always a good thing, both for your team, and your product.

Troubleshooting is a journey. It’s a long, unpredictable trek, one where you know the start and the end points but have zero knowledge about the actual path you need to take in order to get there, to find the root cause of the problem. In your backpack you have knowledge, past experience, and various troubleshooting techniques. For a systems engineer, the enjoyable task of finding the root cause of a problem often feels exactly like this journey into the unknown.

This particular journey relates to an issue we had on servers in our Distributed Load Balancing (DLB) service. The issue itself had nothing to do with load balancing, but the applicable knowledge gained was invaluable. The journey was worthwhile, as always.

The start

Here is the start point of our journey. What you see is that there is a sudden drop of incoming traffic to a single member of the DLB group. Other members instantly picked up the traffic in an equal manner thanks to the Equal-cost Multi-Path routing we have deployed.

Drop in Requests

The rate of incoming traffic dropped to zero within a sampling interval (we pull and store metrics every 10 seconds) and traffic recovered after ~50 seconds. In our systems there are five possible reasons for these kinds of traffic drops:

Switches on the north and south sides stop selecting the server as the next hop for incoming traffic due to a configuration change
The Bird Internet Routing Daemon stops running on DLB server
anycast-healthchecker withdraws routes for all services as they fail their health check.
The BFD protocol on switch side detects that the server isn’t sending hello messages and stops routing traffic to it
Network cable or network card glitch

We examined the log of anycast-healthchecker and found out that all services were successfully responding to health checks. We then looked at the bird log and found the following:

08:35:42.981081+01:00lb-101bird:**BGP1:Received:Otherconfigurationchange**08:35:42.981362+01:00lb-101bird:BGP1:BGPsessionclosed08:35:42.981474+01:00lb-101bird:BGP1:Statechangedtostop08:35:42.991510+01:00lb-101bird:BGP1:Down08:35:42.991775+01:00lb-101bird:bfd1:Sessionto10.248.16.254removed08:35:42.991883+01:00lb-101bird:BGP1:Statechangedtodown08:35:42.991985+01:00lb-101bird:BGP1:Starting08:35:42.992090+01:00lb-101bird:BGP1:Statechangedtostart08:35:42.992191+01:00lb-101bird:bfd1:Sessionto10.248.16.254added08:35:42.992299+01:00lb-101bird:BGP1:Started08:35:42.992399+01:00lb-101bird:BGP1:Connectdelayedby5seconds08:35:42.992502+01:00lb-101bird:BGP2:**Received:Otherconfigurationchange**.......

All DLB servers are dual-home and they establish BGP peering with the switches on the north and south sides. According to RFC4486, the messages in bold indicate that Bird daemon received a BGP message to reset the BGP peering due to a configuration change on the switch side.

We looked at the Bird code and switch logs, and we found out that the switch asked for resetting the BGP peering due to three consecutive missing BFD hello messages. Such messages are exchanged over UDP protocol with an interval of 400 milliseconds and a tolerance of no more than three missed packets (after which the session is declared down).

The DLB server hadn’t sent BFD hello messages for a period of 1.2 seconds! The most interesting part from the above log is that the failure happened concurrently with both BGP peering, which are established over 2 different network cards to different physical switches.

This made us believe that something on the host caused the loss of 3 consecutive BFD messages; it’s very unlikely to have hardware issues at the same time on two different network cards, cables, or switches.

Several occurrences of the issue

The exact same issue was happening on multiple servers at random times across the day. In all occurrences we saw the same lines in the bird log. So, we knew the end of our journey, we just needed to find what makes the system to not send three consecutive UDP packets every 400 milliseconds. We store logs to ElasticSearch and created a kibana dashboard to visualize those errors and started investigating each occurrence.

Our servers are directly connected to the Internet, therefore we looked at possible small duration attacks on the TCP layer. UDP traffic is not allowed, thus we excluded the possibility of an attack with UDP 80/443 traffic. We didn’t notice any sudden increase of incoming TCP, ICMP, and HTTP traffic before the occurrence of the issue.

We also looked at the haproxy log for possible SSL attacks, but we didn’t notice any unusual traffic pattern. So we knew that there was nothing external to the system that could explain the problem.

The first interesting find

The next stage of our journey was haproxy itself. We use collectd for collecting system statistics and haproxystats for haproxy statistics. Both tools help us to gather a lot of performance metrics about haproxy and the system as well. Furthermore, haproxy emits log messages, which contain very useful information that can help figure out what‘is going on in the server.

haproxy exposes CPU usage per process (we run 10 processes): we noticed a spike to 100% utilization around the same time Bird received the messages to reset the BGP peering. In the following graph we can see that all of the haproxy processes had 100% CPU utilization for at least 1 data point.

haproxy CPU usage

The sudden increase of CPU usage wasn’t always followed by BGP peering resets. In some cases, BFD issues were reported by Bird before those CPU spikes. Nevertheless, we continued to investigate the CPU spikes as they were very suspicious.

The CPU utilization of a process is the sum of User Level and System Level CPU usage. Thus, we needed to know if haproxy was spending all this CPU power for performing its tasks (SSL computation, business logic processing, etc.) or for asking the system to do something like dispatching data to various TCP sockets or handling incoming/outgoing connections. The two graphs below suggest that haproxy was spending CPU cycles at the system level. haproxy cpu user level

haproxy cpy system level

This gave us a really good starting point for doing more in-depth analysis on what was causing those CPU spikes. We reviewed the haproxy configuration several times and there was nothing suspicious there. haproxy software hadn’t been upgraded recently, so we excluded a possible software regression which could have caused this behaviour.

We contacted HAPROXY TECHNOLOGIES, INC for assistance. They asked us to collect more information about sessions and TCP connections as there was a bug that could cause high number of TCP connections in CLOSE-WAIT state – but, according to them, that specific bug couldn’t cause CPU spikes.

We also looked at memory utilization of haproxy and there wasn’t anything suspicious there either. But, in all the occurrences we saw a sudden increase of free memory. The system freed ~600MB of memory around the same time as we’d been seeing those CPU spikes.

Free memory

It wasn’t very clear to us if those two observations (CPU spikes and the sudden increase of free memory) were the cause or the symptom of our issue. Moreover, this sudden increase of memory could be related with garbage collector being invoked by some other services. So, more digging was required to clear up the fog in our path.

(s)Tracing the unknown factor

We run many daemons and cron jobs on our servers. In some occurrences of our problem we saw a puppet run happening at the same time. We decided to look at what was executed on every puppet run.

We set up some scripts that were running pidstat against puppet and few other daemons. Since the issue was happening at random times across a lot of servers, we had to pick few servers to run those scripts and wait for the problem to appear.

After a few days of waiting, we had several traces to analyze. Puppet really loves CPU and it can easily lock a CPU for 4-5 seconds. But it wasn’t causing our problem. Other daemons were hungry for memory and CPU resources at a level that couldn’t explain the sudden increase of free memory.

HAProxy support department suggested deploying a script which could run strace and dump sessions when haproxy CPU usage at system level went beyond 30%. The script below was deployed on a single server and was manually invoked for all haproxy processes.

#! /bin/bash## hapee_tracing.shKILL_FILE="/tmp/kill_hapee_tracing"BASE_DIR="/var/log/pidstats/"SOCKET_DIR="/run/lb_engine/"PROCESS_NUMBER="$1"PIDS=($(cat /run/hapee-lb.pid))PID_INDEX=$(($PROCESS_NUMBER-1))PID=${PIDS[${PID_INDEX}]}


mkdir -p "${BASE_DIR}"whiletrue;doif[ -f "${KILL_FILE}"];thenexit 0fi
        timeout 60 pidstat -u -p "${PID}"1| stdbuf -i0 -o0 -e0 egrep 'hapee-lb'| stdbuf -i0 -o0 -e0 awk '{print $6}'|whileread linedoif[ -f "${KILL_FILE}"];thenexit 0fisystem_cpu=$(echo"${line}"| awk -F. '{print $1}')if["${system_cpu}" -gt 30];thenecho'show sess all'| socat ${SOCKET_DIR}process-${PROCESS_NUMBER}.sock stdio > ${BASE_DIR}sessions_$(date +%F:%H:%M:%S)_${PROCESS_NUMBER}_${PID}&
                timeout 5 strace -ttvs200 -p "${PID}" -o ${BASE_DIR}strace_$(date +%F:%H:%M:%S)_${PROCESS_NUMBER}_${PID}fidonePIDS=($(cat /run/hapee-lb.pid))PID=${PIDS[${PID_INDEX}]}done

We deployed a script to dump the number of connections, using the ss tool, and the sar tool was adjusted to capture CPU, memory, and network statistics. All those tools were gathering information every second. Since it only takes 1.2 seconds for a BFD session to be detected as down, we had to gather information on such a small interval.

While we were waiting for the problem to appear on the target machine, we decided to move the Bird daemon to a CPU core which wasn’t used by haproxy. Those 10 haproxy processes are pinned to the last 10 CPUs of the system, so we pinned Bird daemon to CPU 1 and assigned -17 nice level to it. We did that in order to make sure it had enough resources to process BFD messages while haproxy was spinning at 100% CPU utilization. We also changed the CPU priority on puppet agent to utilise less CPU resources.

Light at the end of the tunnel

The issue appeared on the target server and our tracing tools ended up collecting 150MB of data (something very close to 10 millions of lines to read!). We analysed pidstat, sar, ss and strace outputs and we made the following observations together with Willy Tarreau, the author of HAProxy and Linux Kernel developer (note that the high CPU utilization started at 12:41:21 and lasted till 12:41:30):

No change of incoming requests per second prior the problem:

12:38:00PMactive/s passive/s    iseg/s    oseg/s12:41:18PM4403.00836.0048416.0066313.0012:41:19PM4115.00819.0048401.0067910.0012:41:20PM1417.00786.0043005.0057608.0012:41:21PM4225.00824.0035247.0049883.0012:41:22PM1198.00814.0021580.0025604.0012:41:23PM3446.00768.0024229.0033893.0012:41:24PM4269.00773.0030462.0046604.0012:41:25PM2259.00821.0024347.0033772.0012:41:26PM994.06880.2013207.9215813.8612:41:27PM4878.00802.0032787.0050708.0012:41:28PM2988.00816.0036008.0053809.0012:41:29PM3865.00883.0034822.0053514.00

haproxy stopped for ~500 milliseconds in the middle of some operations, indicating that it was interrupted:

12:41:21.913124read(932,"\25\3\3\0\32",5)=512:41:21.913140read(932,"\0\0\0\0\0\0\0\1\244=\241\234Jw\316\370\330\246\276\220N\225\315\2333w",26)=2612:41:21.913171sendto(55,"<a href=\"&#47;general.sv.html?label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaAKIAQGYAS6412:41:21.913199 sendto(55, ";form.append(input).append(check);}});}</script>\n<scriptsrc=\"https://r-e12:41:22.413476 recvfrom(1143, "T[16],r.MB,r.MN,null),T[17],r.ME(T[18],r.MB,r.MN,null),T[12],r.ME(T[1912:41:22.413512sendto(55,"T[16], r.MB, r.MN, null), T[17], r.ME(T[18], r.MB, r.MN, null), T[12], r.ME(T[19], r12:41:22.413539 connect(2665, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("10.198.156.38")},16)

High rate of page free / cache free activity, system was freeing 600MB of RAM per second for a period of ~6 seconds:

12:38:00PMfrmpg/s   bufpg/s   campg/s12:41:16PM-5423.000.00-754.0012:41:17PM-1784.001.001504.0012:41:18PM-1868.001.00337.0012:41:19PM-1110.002.00416.0012:41:20PM16308.00-27.00-10383.0012:41:21PM77274.00-56.00-71772.0012:41:22PM154106.00-147.00-121659.0012:41:23PM121624.00-253.00-93271.0012:41:24PM109223.00-238.00-84747.0012:41:25PM140841.00-384.00-116015.0012:41:26PM142842.57-573.27-121333.6612:41:27PM83102.00-263.00-59726.0012:41:28PM118361.00-1185.00-80489.0012:41:29PM168908.00-558.00-103072.00

System had 92% of the memory already allocated prior to the issue and started freeing a second later:

12:38:00PMkbmemfreekbmemused%memusedkbbufferskbcachedkbcommit%commitkbactivekbinactkbdirty12:41:13PM74174249145618492.502384806075134492230129.2334595324285017047200012:41:14PM74099689146364092.512384886075244892230129.2334595720285017567295612:41:15PM73826849149092492.532384886076226492366049.2434620324285018528274012:41:16PM73609929151261692.562384886075924892555009.2634638704285019847972412:41:17PM73538569151975292.562384926076526492555009.2634640664285021688574812:41:18PM73463849152722492.572384966076661292528249.2634642028285022487214812:41:19PM73419449153166492.572385046076827692543569.2634644128285023287352812:41:20PM74071769146643292.512383966072674492528289.2634593160285093687905212:41:21PM77162729115733692.202381726043965692585169.2734253456285641208024812:41:22PM83326969054091291.572375845995302092699969.2833694136286460368107212:41:23PM88191929005441691.082365725957993692868249.2933269068286988968438812:41:24PM92560848961752490.642356205924094892903369.3032890520287448209425612:41:25PM98194488905416090.07234084587768881049113610.50324034722879642810030812:41:26PM103965328847707689.492317685828670093648729.37318955562883875610344412:41:27PM107289408814466889.152307165804779693870529.39316473482885060412141212:41:28PM112023848767122488.672259765772584093589969.37312867962887182412725212:41:29PM118780168699559287.992237445731355293608449.37309554682880007213859612:41:30PM125997808627382887.262218365696162093500569.36306884242871364815348812:41:31PM126389248623468487.222218405698332893452689.35307065962871497617597212:41:32PM126064608626714887.252218405701604093501129.36307375562872145620868412:41:33PM125912848628232487.272218405703470493459049.353075139628724348217184

CPU saturation at system level while the system was freeing memory:

12:38:00PMCPU%user%nice%system%iowait%steal%idle12:41:20PM011.220.008.160.000.0080.6112:41:20PM114.290.006.120.000.0079.5912:41:20PM218.000.0011.000.000.0071.0012:41:20PM316.000.0012.000.000.0072.0012:41:20PM430.610.0015.310.000.0054.0812:41:20PM511.340.003.090.000.0085.5712:41:20PM628.570.009.180.000.0062.2412:41:20PM716.160.0012.120.000.0071.7212:41:20PM820.830.006.250.000.0072.9212:41:20PM919.390.003.060.000.0077.5512:41:20PM1014.290.0012.240.000.0073.4712:41:20PM1116.160.004.040.000.0079.8012:41:21PMall13.290.0041.880.000.0044.8312:41:21PM06.060.0017.170.000.0076.7712:41:21PM114.140.0022.220.000.0063.6412:41:21PM214.290.0045.920.000.0039.8012:41:21PM318.180.0046.460.000.0035.3512:41:21PM410.000.0051.000.000.0039.0012:41:21PM514.140.0046.460.000.0039.3912:41:21PM620.000.0041.000.000.0039.0012:41:21PM715.310.0038.780.000.0045.9212:41:21PM814.140.0045.450.000.0040.4012:41:21PM911.000.0047.000.000.0042.0012:41:21PM1010.000.0052.000.000.0038.0012:41:21PM1111.110.0050.510.000.0038.3812:41:22PMall9.580.0084.180.000.006.2412:41:22PM04.080.0070.411.020.0024.4912:41:22PM12.020.0062.630.000.0035.3512:41:22PM216.000.0077.000.000.007.0012:41:22PM314.000.0086.000.000.000.0012:41:22PM47.000.0093.000.000.000.0012:41:22PM53.000.0097.000.000.000.0012:41:22PM612.000.0085.000.000.003.0012:41:22PM715.000.0083.000.000.002.0012:41:22PM813.860.0084.160.000.001.9812:41:22PM99.090.0090.910.000.000.0012:41:22PM1010.000.0090.000.000.000.0012:41:22PM119.000.0091.000.000.000.0012:41:23PMall17.730.0075.750.000.006.5212:41:23PM024.000.0067.000.000.009.0012:41:23PM15.050.0055.560.000.0039.3912:41:23PM214.140.0080.810.000.005.0512:41:23PM326.730.0073.270.000.000.0012:41:23PM414.000.0086.000.000.000.0012:41:23PM524.000.0075.000.000.001.0012:41:23PM616.000.0076.000.000.008.0012:41:23PM713.270.0079.590.000.007.1412:41:23PM818.000.0075.000.000.007.0012:41:23PM919.610.0079.410.000.000.9812:41:23PM1016.000.0083.000.000.001.0012:41:23PM1121.000.0078.000.000.001.0012:41:24PMall16.990.0070.140.080.0012.7812:41:24PM011.340.0049.480.000.0039.1812:41:24PM113.130.0045.450.000.0041.4112:41:24PM219.000.0066.000.000.0015.0012:41:24PM320.410.0071.430.000.008.1612:41:24PM419.000.0079.000.000.002.0012:41:24PM517.170.0079.800.000.003.0312:41:24PM621.210.0067.680.000.0011.1112:41:24PM720.200.0067.680.000.0012.1212:41:24PM819.390.0063.270.000.0017.3512:41:24PM97.220.0090.720.000.002.0612:41:24PM1014.140.0083.840.000.002.0212:41:24PM1120.000.0079.000.000.001.0012:41:26PM921.780.0078.220.000.000.00

Some low-rate activity for page swapping:

12:39:40PMpswpin/s pswpout/s12:39:43PM0.000.0012:39:44PM0.0078.2212:39:45PM0.0075.0012:39:46PM0.000.00(...)12:41:20PM0.009.0012:41:21PM0.0043.0012:41:22PM0.0084.0012:41:23PM0.0070.0012:41:24PM0.0053.0012:41:25PM0.0074.00

haproxy was writing data, which is odd: it shouldn’t know how, considering that when the service starts, it closes all the file descriptors related to all the files that can cause I/O operations to the filesystem:

12:41:21PMUIDPIDkB_rd/s   kB_wr/s kB_ccwr/sCommand12:41:22PM498288490.000.000.00hapee-lb12:41:22PM498288920.005964.000.00hapee-lb12:41:22PM498288940.000.000.00hapee-lb12:41:22PM498288950.000.000.00hapee-lb12:41:22PM498288960.000.000.00hapee-lb12:41:22PM498288970.006276.000.00hapee-lb12:41:22PM498288990.0020.000.00hapee-lb12:41:22PM498289010.000.000.00hapee-lb12:41:22PM498289020.000.000.00hapee-lb12:41:22PM498289040.000.000.00hapee-lb12:41:22PM498289050.000.000.00hapee-lb

All the haproxy processes started to do some minor page faults. They’d touched a free memory area for the first time since that area was last reclaimed:

12:41:20PMUIDPIDminflt/s  majflt/sVSZRSS%MEMCommand12:41:21PM052060.000.004634018240.00hapee-lb-system12:41:21PM498288490.000.00177540852960.09hapee-lb12:41:21PM49828892102.000.002132041211640.12hapee-lb12:41:21PM49828894179.000.002165921243240.13hapee-lb12:41:21PM49828895116.000.002133601226760.12hapee-lb12:41:21PM49828896153.000.002118401225440.12hapee-lb12:41:21PM49828897106.000.002102361218160.12hapee-lb12:41:21PM4982889955.000.002101961180000.12hapee-lb12:41:21PM49828901140.000.002121921202880.12hapee-lb12:41:21PM49828902125.000.002146161232120.12hapee-lb12:41:21PM4982890481.000.002159881171960.12hapee-lb12:41:21PM49828905110.000.002119841126920.11hapee-lb

Memory usage of haproxy processes remained stable and didn’t change one second later, showing that it was just touching memory that was aggressively reclaimed by the system:

12:41:21PMUIDPIDminflt/s  majflt/sVSZRSS%MEMCommand12:41:22PM052060.000.004634018240.00hapee-lb-system12:41:22PM498288490.000.00177540852960.09hapee-lb12:41:22PM49828892284.000.002133561216600.12hapee-lb12:41:22PM49828894231.000.002171441249000.13hapee-lb12:41:22PM49828895231.000.002139361229920.12hapee-lb12:41:22PM498288968.000.002118401225440.12hapee-lb12:41:22PM49828897223.000.002103801221320.12hapee-lb12:41:22PM49828899311.000.002124921187520.12hapee-lb12:41:22PM49828901223.000.002124601206400.12hapee-lb12:41:22PM49828902214.000.002146161235160.12hapee-lb12:41:22PM49828904219.000.002159881174040.12hapee-lb12:41:22PM498289052.000.002119841126920.11hapee-lb

Willy Tarreau also inspected session information as they were dumped from haproxy memory and didn’t find anything unusual. He finished his investigation with the following:

virtual machines using memory ballooning to steal memory from the processes and assign it to other VMs. But from what I remember you don't run on VMs (which tends to be confirmed by the fact that %steal is always 0)
batched log rotation and uploading. I used to see a case where logs were uploaded via an HTTP POST using curl which would read the entire file in memory before starting to send, that would completely flush the cache and force the machine to swap, resulting in random pauses between syscalls like above, and even packet losses due to shortage of TCP buffers.

Given the huge amount of cache thrashing we're seeing (600 MB/s), I tend to think we could be witnessing something like this. The fact that haproxy magically pauses between syscalls like this can be explained by the fact that it touches unmapped memory areas and that these ones take time to be allocated or worse, swapped in. And given that we're doing this from userspace without any syscall but consecutive to a page fault instead, it's accounted as user CPU time.

I also imagined that one process could be occasionally issuing an fsync() form (after a log rotation for example), paralyzing everything by forcing huge amounts of dirty blocks to the disks; that didn't seem to be the case and There wasn’t ever any %iowait in sar reports, implying that we weren’t facing a situation where a parasitic load is bugging us down in parallel.

Another point fueling the theory of memory shortage is sar's output (again) showing that the memory was almost exhausted (92% including the cache) and that it started getting better at the exact same second the incident happens.

To sum up: memory shortage led to a sudden and high-rate freeing of memory which locked all CPUs for ~8 seconds. We knew that high CPU usage from haproxy was the symptom and not the cause.

Finding the memory eater(s)

What triggered our system to free memory at such high rate (600MB/s) and why was this so painful for our system? Why did the kernel use so much memory (~92%) for caches while active memory was always below ~8GB? There were many questions to answer, which brought us back to tracing mode.

Willy suggested to issue echo 1 > /proc/sys/vm/drop_caches upon log rotation, which we did in all servers. We also issued once echo 2 > /proc/sys/vm/drop_caches in two DLB groups. Both of these actions calmed our issue down but only for a small amount of time.

From the many processes running on our servers, we picked 5 with the highest resident memory (RSZ) and started monitoring them very closely with pidstat. We also started monitoring memory activities, noticing a high number of entries for dentry objects in the cache:

Active / Total Objects (% used)        : 388963608 / 415847504 (93.5%)
Active / Total Slabs (% used)          : 19781213 / 19781213 (100.0%)
Active / Total Caches (% used)         : 69 / 101 (68.3%)
Active / Total Size (% used)           : 73098890.88K / 78163097.48K (93.5%)
Minimum / Average / Maximum Object : 0.01K / 0.19K / 15.88K


OBJS      ACTIVE     USE  OBJ SIZE  SLABS     OBJ/SLAB CACHE SIZE  NAME
414578178 387795876   0%  0.19K     19741818  21       78967272K   dentry
244998    244998    100%  0.10K     6282      39       25128K      buffer_head
160344    158020     98%  0.64K     6681      24       106896K     proc_inode_cache
158781    149401     94%  0.19K     7561      21       30244K      kmalloc-192
119744    94951      79%  0.06K     1871      64       7484K       kmalloc-64
59616     48444      81%  0.25K     1863      32       14904K      kmalloc-256

atop was also reporting ~100% of SLAB memory as reclaimable memory:

MEM | tot        94.3G |  free        9.2G | cache   5.4G | dirty  18.1M  | buff   15.1M | slab   75.7G |  slrec  75.5G | shmem   4.1G | shrss   0.0M  | shswp   0.0M |

The output of tracing tools we had put in production didn’t provide much useful indicators about which process(es) could cause that high memory consumption for caches. haproxy log (which is rotated every hour) had ~3.5GB of data and dropping page caches upon log rotation excluded rsyslogd from the investigation as well.

We started to read documentation about memory management and realized that our system may not be tuned correctly, considering that our servers have 96GB of total memory, only ~8GB of active memory and have the following memory settings in place:

vm.vfs_cache_pressure set at 100
vm.dirty_background_ratio set at 3
vm.dirty_ratio set at 10

So, the system had a lot of free memory to use for caches – which it did, and it wasn't aggressively reclaiming memory from caches even when dentry objects in cache occupied 80GB. That led the system to have around 800MB in free memory in some cases.

We changed vm.vfs_cache_pressure to 200 and freed reclaimable slab objects (includes dentries and inodes) by issuing echo 2 > /proc/sys/vm/drop_caches. We started to see more free memory available (~7GB) after 2 days and then we increased vm.vfs_cache_pressure to 1000. That made the system to reclaim memory more aggressively – and the issue was almost entirely resolved.

We continued our investigation in the area of dentry caches and found this bug report for curl tool. The bug report states that, when curl makes a HTTPs request there were many access system calls to files that don’t exist, have random names and are unique per invocation:

me at node1 in ~
strace -fc -e trace=access curl 'https://foobar.booking.com/' > /dev/null
Process 2390 attached
% time         seconds  usecs/call         calls        errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00        0.003180               0          6647          6643 access
------ ----------- ----------- --------- --------- ----------------
100.00        0.003180                      6647          6643 total


me at node1 in ~
strace -f -e trace=access curl 'https://foobar.booking.com/' 2>&1 |head -10
(...)
access("/etc/pki/nssdb", W_OK)              = -1 EACCES (Permission denied)
access("/home/me/.pki/nssdb/.3219811409_dOeSnotExist_.db", F_OK) = -1 ENOENT (No such file or directory)
access("/home/me/.pki/nssdb/.3219811410_dOeSnotExist_.db", F_OK) = -1 ENOENT (No such file or directory)
access("/home/me/.pki/nssdb/.3219811411_dOeSnotExist_.db", F_OK) = -1 ENOENT (No such file or directory)
access("/home/me/.pki/nssdb/.3219811412_dOeSnotExist_.db", F_OK) = -1 ENOENT (No such file or directory)
access("/home/me/.pki/nssdb/.3219811413_dOeSnotExist_.db", F_OK) = -1 ENOENT (No such file or directory)

We knew that we use curl in the check_cmd for each service check in anycast healthchecker daemon and that check runs every 10 seconds for ~10 services. So, we fired up a one-liner to plot the number of dentry objects in cache per second:

while(true);doecho"$graphite_name_space.$(hostname|sed -e 's/\./_/g').dentry $(sudo slabtop -o | egrep 'dentry'|awk '{print $1}')$(date '+%s')"| nc 127.0.0.1 3002;
sleep 0.9;done

In the following graph we can see that the number of dentry objects was increasing at a high and constant rate:

dentries

Bingo! We found the tool which was polluting dentry cache. Finally, we see our destination; time to prepare the cake.

The fix was very easy – just setting the environment variable NSS_SDB_USE_CACHE to YES was enough:

me at node1 in ~
NSS_SDB_USE_CACHE=YES strace -fc -e trace=access curl 'https://foobar.booking.com/' > /dev/null
Process 14247 attached
% time         seconds  usecs/call         calls        errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00        0.000009               0            32            30 access
------ ----------- ----------- --------- --------- ----------------
100.00        0.000009                        32            30 total

We adjusted check_cmd for each service check in anycast-healthchecker which utilized the curl tool, and we also modified a cron job, which was running curl many times for HTTPs site. In the following graph, we can clearly see that the pollution was stopped as the number of dentry objects in the cache wasn’t increasing:

fix curl

Conclusions

Your sampling intervals for statistics may hide problems. In our case, we collect metrics every 10 seconds and we were able to (eventually) see the issue clearly. If you collect metrics every minute on systems that receive ~50K requests per second, you won’t be able to see problems that last less than minute. In other words, you fly blind. Choose the metrics to collect and pick the intervals very wisely.
Abnormal system behaviors must be investigated and the root cause must be found. This secures the stability of your system.
Reshuffling TCP connections when a single member disappears and appears in ECMP group didn’t impact our traffic as bad as we initially thought it would do.

I would like to thank Marcin Deranek, Carlo Rengo, Willy Tarreau and Ralf Ertzinger for their support in this journey.

There are more than 500 programming languages, and it’s likely you want to start learning a new one today. You might know C++ and Java, but at work you use Python. Perhaps you’re well- versed in Python but need to pick up some Java for a new project, or maybe you want to learn that cool language to expand your expertise.

So what are some popular options if you want to learn a new programming language?

Learn from an online tutorial or
Learn from an online course

Some may argue that the best way of learning a new language is:

To learn syntax of the new programming language, then
Build a personal project using that language

Fair enough! Building a project makes sure that you apply the knowledge you gain from learning the syntax of the language you want to learn.

I have developed more than 20 small projects while learning different languages. When I write code for my personal project, weekend project or even a quick overnight hack, I write code to get something done. All I care about is — “Does my code work?”. I hardly care about the quality of the code while learning.

This is my story about how I took an unconventional way of learning a new language efficiently.

I work at Booking.com as a Graduate Software Developer. As part of my role, I switch teams every four months to get a good understanding of different aspects of the business.

In the office, we mainly work with Java and Perl. However, there are a lot of other languages being used for different purposes, so we use Python for data analysis, Go for micro-services, and so on.

Last year, I got an offer for a full-time job from Booking.com and I knew that I’d be working with Perl. As soon as I finished university, I started learning Perl to prepare myself for the job. I began by learning syntax of Perl and understanding some of the common patterns of the language.

Now, I really wanted to build something using Perl so that I could practice various concepts of the language. While I was looking for ideas, I came across DuckDuckGo’s open source organization on GitHub. I started working on beginner-friendly issues. While doing so, I always used to look at the existing modules. I used to notice the patterns being used in Perl. Slowly, I started picking up good practices in my own code and it helped me in learning how to write good code in Perl.

In December, I moved to a new team in Core Infrastructure, something I’ve always been curious about. One of the best things about the Booking.com graduate program is that you have flexibility in choosing your next team. I was really excited to learn from this opportunity.

Two weeks into my new team, I picked up a task of adding a new feature to one of the services written in Go. This was a conversation I had with my teammate:

Me: I really like this task. I want to work on it. What do you think?
Teammate: Yes, it’s an interesting one. However, it requires knowledge of Go. Do you know Go?Me: No.Teammate: Do you want to learn Go?Me: Yes!Teammate:smiles There you go!

So there I was , at the verge of learning a new programming language: Go!

I started learning Go’s syntax and found an awesome beginner’s language tour on their official website. It was good enough to get me acquainted with the basic concepts of the language.

That evening, out of curiosity, I started looking for open source projects in Go which had some “beginner” or “easy-fix” issues. I found a project — go-github by Google which is basically a Go wrapper for GitHub’s REST API.

I had my first pull request on that project two days after I started learning Go.

My contribution graph for 2016—https://github.com/sahildua2305

How Does Open Source Help?

So, now you may be wondering about how open source contributions can help in learning the good practices of a language. Here are the lessons I learned along the way .

Code Quality

Most of the good open source projects have strict coding guidelines you have to adhere to in order to get your code merged. This helped me in adapting to those guidelines and hence learning to write good quality code even though I was just learning the language. Not just that, I had a chance to look at the rest of the code base and see how good code is written in that language.

Code Reviews

The best part about open source contributions is code reviewing. When you push your code, you get feedback from the experts associated with that project. It gives you a chance to improve your understanding of a language. It’s like getting free personal guidance about how to write good code!

Appreciation

First comment on my first pull request in Go Lang

Everyone likes to be appreciated for their work, and the open source community are pretty generous with their praise. In my whole experience with open source contributions, I have never received even a single comment which was insulting or demotivating. Everyone is super encouraging and helpful.

Another comment by MariagraziaAlastra from DuckDuckGo

Even though I was learning a new language to work on some task at work, I managed to enhance my expertise at the same time and contribute to the open source project. A perfect win-win-win situation!

So, next time you want to learn a new language, go ahead and take the plunge! Find an open source project to contribute to and march ahead on the path to learn that language and its nuances.

In April, Booking.com hosted PyData Amsterdam 2017. The Booking.com headquarters was filled with 330 Python developers and data scientists from all over Europe, who gathered for a weekend full of talks and discussions all about using and evolving Python for Data Science applications. The atmosphere was wonderful, with interesting presentations, people meeting others from the PyData community, sharing experiences, problems and solutions, discussing future developments, and everything in between. As the Dutch would say: gezellig!

We had 32 talks at the conference covering a wide array of PyData-related subjects, from Deep Learning, to Data Visualization, to the Ethics of Machine Learning. Booking.com itself contributed three talks to the conference, which were similarly diverse: applying Deep Learning in production, how to diagnose statistical models, and on using NLP for song lyrics.

Deep Learning

Deep Learning is currently a hot topic, so it was no surprise that it was featured in almost a third of all the talks at this year’s PyData conference.

Representing Booking.com, Emrah Tasli and Stas Girkin dived into the complex problem of image understanding. Emrah showed how Booking.com’s unique corpus of millions of tagged photos enables us to train a deep convolutional neural network specialised to output image labels that are relevant to our exact problem. Stas took us through the technical details of scaling this to work for our millions of users daily, and how to test the direct benefits to our customers via A/B testing.

The range of Deep Learning topics covered at the PyData conference was very broad, and really gave us a sense of just how powerful this tool can be. For example, Mark-Jan Harte talked us through an application in the medical domain with his inspiring talk on “Training a TensorFlow model to detect lung nodules on CT scans”. Dafne van Kuppevelt covered a wide range of applications in her talk “Deep learning for time series made easy”, namely ecology/classifying bird activity, movement sensing/classifying human activity, and classifying epilepsy from EEG. As the title of the talk suggests, it was certainly refreshing to see a more beginner-friendly talk on the subject.

Diagnosing statistical models

Ever wonder why all your coefficients in your linear model turn up insignificant? Wonder no more! Lucas Bernardi shared some of his pragmatic Data Science tricks to diagnose statistical models in a clear-cut way. He elaborated on one of the possible reasons for the insignificance of coefficients: features that are not independent of each other (multicollinearity). As Lucas stressed, this problem should be tackled especially when the main goal is understanding and interpreting the model, rather than focusing on accurate predictions. He explained how to use a clustered correlation plot to find and deal with the multicollinearity of features in explanatory models.

The second topic Lucas covered was monitoring and diagnosing a classification model that is used in production. As an example, he chose the “Business vs Leisure” model used on the Booking.com website. In short: when a user does not indicate in the search box whether they are travelling for leisure or business, we still want to predict the probability that they are a business booker. In order to optimise the user experience, we might show different versions of our website depending on what this model predicts. The challenges that could occur in this live environment is that our data could be:

incomplete (not all the data ends up labelled, so there’s no way to evaluate all data against a ground truth);
delayed (the visitor might book only some time later);
dynamic (the label and feature space distributions change over time).

In this real word scenario, how can we monitor model performance, and diagnose any trouble? Lucas advocated the use of “Response Distribution Analysis”, which means looking at the the probability distribution of the model output over all of the presented examples. You could also call this the distribution of the probabilities of the probability to be in the positive class. Ideally, we want this to be a bimodal distribution, and use the “valley” between the peaks as the threshold value. To learn about the interpretation of more patterns in the response distribution, watch the recording at:

NLP on heavy metal lyrics

In a full schedule of 2 days of talks from 9 to 6, we were happy to have a few lightweight, fun and far-out talks too! Jon Paton showed us how English looks to non-English people in his talk on character level Markov models, “Simulate your language. ish.” Another talk in this line was Rogier van der Geer’s “Risk Analysis”. Contrary to what any financial analysts in the audience might have hoped for, we learned from Rogier’s talk how to win the board game Risk using genetic algorithms. Even one of the keynotes had a fun edge; in his presentation “Python versus Orangutan”, Dirk Gorissen shared his experiences with using Python to train drones to find orangutans in the rainforests of Borneo.

For Booking.com, Iain Barr showed that our Data Scientists don’t just care about holiday travel. Iain explained how he applied NLP to the song lyrics of metal bands. We’ll never forget his definitions of “metal-ness” of a word:

It turns out that using this simple idea - of comparing each word’s frequency in metal lyrics to its frequency in normal English - gives a pretty good measure of what we’d intuitively mean by “metal-ness”. So: the most metal word in the English language is ‘burn’, closely followed by ‘cries’, ‘veins’, and ‘eternity’. Want to know the least metal words? If you’re particularly (hint) interested then you can relatively (hint) easily check out Iain’s talk:

More PyData

You can find recordings of all talks from PyData Amsterdam 2017 here. Overall, the PyData Amsterdam 2017 conference was a great success and a learning experience for us. We learned a lot about Data Science and Python, hosting and organising a conference - and we had a lot of fun too. Here’s to PyData Amsterdam 2018!

1. Introduction

Tagging images (image classification) has long attracted interest from the research community. Conventional techniques have focused on creating densely populated, hand-crafted and low-level (pixel-based) image descriptors, using some middle or higher level decisions using different classification techniques. The last couple of years have seen some impactful changes in the area, aiming at improving the quality of conventional techniques using deep convolutional neural networks (CNNs), which are effectively used in many different image-related tasks to outperform the conventional techniques. Therefore, we’ve taken that approach to create an automated image tagging solution at Booking.com using deep convolutional neural networks. There are some commercial solutions for the task that can automatically analyse and generate tags for each image in our database. However, we have a unique corpus with special needs that require domain expertise to explore the value of specific image tags. We needed to create our own solution with a long list of photo tags for our internal image classification.

2. Convolutional Neural Networks (CNN):

Neural networks have been widely studied mostly as a supervised learning technique, where a network of artificial neurons are taught to adjust their weights according to given ground-truth data. The adjustment is enforced by showing the true examples of the task and forcing the model to give accurate predictions for these known examples. Given enough examples, the network adjusts its weights in such a manner that it can generalise to the new unlabelled samples and produce the required output with expected accuracy.

Convolutional neural networks were introduced in the 1990s. Despite some early remarkable results in the optical character recognition task [1], a wider domain adaptation has not been possible mostly due to their compute intensive requirements and not extending to other visual tasks. Through recent improvements in computer hardware and especially graphical processing units (GPU), it’s now possible to use larger networks and analyse bigger datasets under realistic time constraints (days to weeks). This has brought drastic improvement on many vision tasks and still attracts attention from both the academia and industry. For a detailed explanation of CNNs and how they work, this tutorial is worth watching.

2.1. CNN Architecture

There are various deep network architectures proposed in the literature, with different network depth, width and different sizes of convolutional kernels. Most complex architectures has brought some additional performance improvements for the vision tasks. In our work, we have compared different architectures with respect to performance-computation tradeoff. GoogleNet [2] and its successor Inception-v3 [3] have been the two main architectures we’ve conducted our experiments on. After many different trainings and hyper parameter tunings, inception-v3 has consistently shown a slight (~1-2%) performance improvement (top 5 accuracy) at the cost of increased computational load (~4 times).

2.2. Transfer Learning

Transfer learning is a machine learning concept where a supervised learning technique is trained on domain A, and this learning/information is further applied on a different (but similar) domain B. This is a common and very useful approach in such cases where we have a good amount of labelled samples on domain, A but not as much on domain B. In our context, the labelled data we had was from two different sources; the first came from a third party company, while the other came from hoteliers tagging their own photos. From both sources we were able to get more than 10M tagged images with different levels of noise in the tags. Therefore, we decided to train a network from scratch and at the same time compare the performance with the transfer learning approach, where we used a pretrained network that was trained on a different dataset (IMAGENET) on a similar image classification task.

IMAGENET is a crowdsourced dataset, collected over many years from many different locations around the world [4]. There’s over 10M images tagged with 1,000 image labels. Though the task is similar, the nature of the images is different from our hotel corpus. The images in the IMAGENET dataset mostly contain a centrally located object or an animal and only a single tag per image is supplied. On the contrary, our hotel images contain randomly located objects and could easily contain more tags per image (sea-view, bed, TV, balcony, seating area, etc.)

During transfer learning, the weights of the pre-trained network are used to initialize the network (as compared to random initialisation during scratch training) and further tuned with our internal images and with the new tags that we created.

2.3. Performance Evaluation

Performance evaluation is an important part of any machine learning task. For this task, we created a dataset which has not been seen by the network before. The ground truth labels of the dataset are generated by our internal employees, where the same photo has been shown to two different people and has been accepted only if they agree on the tag. The collected dataset has been used to compare model performance. In total, we trained more than 50 models (through hyperparameters search), evaluating them using precision and recall metrics. What we expect from the image tags:

Be accurate - If we tag one image as A, it should be correct (Precision)
Bring traffic - If there is an image with tag A, we should not miss it (Recall)

Since we can increase one metric at the cost of harming the other, we selected a sweet spot per category favoring either recall (more traffic for experiments) or precision (we certainly don’t want to confuse a bedroom with swimming pool). We also provided a confidence score for all the tags, which indicates how confident the network is for the predicted tag. This way, we could select different levels of confidence for an experiment.

Looking at the results per tag level, we can see that the final model is very good on some classes, but performs poorly at some others. The classes like Floor plan, Bathroom, Bed, Swimming pool, etc. seems to be very accurate, whereas some classes are confused with each other a lot. Some examples of the confused classes: Sea view -> Balcony/Terrace, Lake view -> Sea view, River view -> Sea view, Breakfast -> Lunch/Dinner.

Even if it sounds trivial, it’s actually very tricky to assign a label to an image. Think about the garden tag; where should you start calling it garden? Is some grass on the ground enough? Or do you need some plants? If so, how many? We saw a lot of ambiguity (disagreement) in assigning some of the classes, and that was a hard challenge to address. The confidence scores supplied for each tag are supposed to help with those confusions and should be used in experiment hypothesis creation.

2.4. Framework

In order to make this computationally intensive task feasible, we used GPUs; though even with the high performing GPUs it still takes up to 12 hours to fine tune a network. When we tried to train from scratch, it took around 48 hours for the training loss to converge. All the experiments have been done with TensorFlow, and we used Tensorboard for visualizing loss and accuracy during the training.

3. Challenges & Learnings

The main objective in this project is to create an automated image tagging pipeline with high accuracy and to further implement this solution at scale to address all the existing and newly coming images. In order to improve the accuracy we have done extensive experiments with some of the well known techniques.

3.1. Data augmentation

Deep networks are known to be data-hungry. They happen to benefit from more and more data, which means that we need a lot of labelled data as our training dataset. However, getting high volumes of high quality data is always costly and sometimes not even possible. Therefore, the data augmentation technique is used to obtain higher number of labelled data samples. Data augmentation actually refers to slightly distorting an image and using it multiple times with the same label during the training process; this also lets us get the most value from a single image. Some examples of commonly used distortions are mirroring, random cropping, affine transformation, aspect ratio distortion, color manipulation, contrast enhancement, etc. What we did in the data augmentation phases is to randomly select one or more of the listed random distortions to increase the labelled data by 10 times. These image distortions have been applied in a preprocessing pipeline where a random selection of single or multiple distortions are applied on the fly during training. No additional image is saved during the process.

3.2. Image label hierarchy

Labelling images is a hard task - but even before that, deciding on which labels to collect is even harder. We have approached this issue with two things in mind. First, the business value: Is there a known or potential use case for detecting an object in an image? This is a hard question and we have discussed with many experienced stakeholders to get an answer for this. On the other side, we tried to pick the labels that are realistic to detect with a reliable accuracy to be used on the website. While doing that, we have also constructed a visual hierarchy in the image labels. This can eventually give us more confidence in the final labels’ precision. The motivation behind this is to be able to detect confidently that an image belongs to the water view class even we are not very sure if that is a river view or a lake view. Similarly, if we are not sure about the exact type of the room (lobby or restaurant), we can at least say this is an interior image.

The literature about using image label hierarchy has featured two different approaches. In the first approach, hierarchy is imposed as a post-processing step during inference by selecting between maximal hierarchy vs maximum accuracy to increase the information gain as described in [5]. This technique has the advantage of being able to be applied without retraining the model and hence easy to implement. Alternatively, it is also possible to incorporate this information during the model training by using a multinomial logistic loss and a novel temporal sparsity penalty as described in [6]. This technique has the advantage to learn the hierarchy during training. Another technique that imposes a graph structure explore the flexible relations between image labels also is worth checking [7].

Due to implementational simplicity, we’ve tried the first approach and observed an average of 1% improvement in top-1 accuracy for both GoogleNet and Inception v3.

3.3. Multi labelled data

The images in our dataset naturally belong to multiple classes. The room photo could easily belong to classes like interior, bedroom, bed, city-view, TV, etc. In order to account for this, we tried two main approaches.

The first approach is to enable soft labels for the images where we do not use binary, but instead floating point labels. This idea is introduced by Hinton [8] with the aim of propagating more information into the network. The motivation is to let the network know about the similarities of different labels. How, for example, confusing a bird with chicken is less wrong than confusing the bird with a car. We have accounted for this by imposing a non-zero fixed prior probability for the different classes during the training.

The second approach is to resample the labels using the co-occurrence statistics. This can be seen as the data augmentation step where the non-zero priors are obtained by observing the co-occurrences of the classes in the validation set.

Both of the approaches did not bring any additional increase in the final top-1 and top-5 accuracy. The hypothesis we have is that the co-occurrence labels on the validation set were too noisy to be effective.

3.4. Class imbalance

Another challenge we faced during the project is the highly imbalanced classes. The hotel images mostly contain bedroom, bathroom and lobby photos. Moreover, the hoteliers would like to post the photos of the facilities they provide. Sauna, fitness, kid facilities, table tennis and golf are some of the facilities a hotel can have. Naturally, the number of photos we have for the bedroom and table tennis are highly out of proportion. This is a challenge for the network since it picks up the prior distribution of the classes as a bias and hence cannot be “fair” to different classes of photos. In order to account for that, we have performed an over-sampling of the under-sampled classes and it has shown ~.5% increase in top-1 and top-5 accuracy.

3.4. Stochastic optimization

Gradient descent is a technique to minimize a cost function by updating the network parameters in the inverse direction of the gradient. There are different optimization techniques proposed in the literature and most of them are already implemented in most deep learning toolboxes. In order to understand and benefit from different optimization techniques, we have experimented with momentum SGD [9], Adam [10] and RMSProp [11] optimizers. There were no significant difference in the final loss given enough iterations for all three techniques. As a side note, decreasing the learning rate at every 10K iterations has shown great value by an increase of 0.7% in the final top-1 accuracy.

4. Application Examples

4.1. Personalization with images

The whole aim of creating image tags is to be able to personalise the website and give our customers a unique experience throughout the funnel. We hypothesise that by serving the right image in the right context for each individual user, we can create this personalised experience, helping our customers make even more informed decisions. In order to understand and personalise the experience, we’re interested in understanding the intent of the customer. This we can do using the customer behaviour; if filters are used (a hotel with breakfast included, or a hotel with beach front, etc.), we can show the relevant photo for that search. As an extended goal, we are investigating the ways to obtain the optimal photo ranking per user within the given context.

4.2. Food World Map

We encourage our guests to upload their holiday photos after their stay. This way we can help our customers with real images from the previous guests. In this hackathon project we accumulated the photos that were tagged as food, and pinned them on the map for a nice visualisation. This way, we can identify spatial point clusters and suggest areas where travellers can experience different types of cuisine during their stay.

4.3. Unsupervised clustering of image embeddings:

Deep neural networks have proven to be successful for supervised classification tasks. The improved accuracies are only possible due to the fact that different layers in the network hierarchy respond to specific levels of details. Lower layers learn to respond to low level characteristics like edges and corners, while the higher layers learn to respond to object parts; finally, the highest layer responds when we show a sample image containing one of the classes in the training set. The final decision layer has been used in our project to assign tags to the images. But when we remove the final layer, we end up with raw activations of the highest layer neurons. And these activations are very good representations (embedding) of an image. When we make an unsupervised clustering on this high dimensional embedding and project it onto the 2D surface, we get the kind of visualisations seen below.

Images with similar semantic representation are grouped together. Like floor map photos or bathroom or food photos. This is achieved using the well known t-SNE dimensionality reduction technique [11].

5. Conclusion

In this project we aimed to automate and scale the image tagging pipeline and make it consistent and accurate. The current system has been implemented in the Booking.com infrastructure and all the existing and newly coming images are analysed. As a future project, we intend to introduce feedback on the generated image tags and use them for retraining the network for improving accuracy.

This is a joint work with Stas Girkin. Credits to Antal Guszlev and Fedor Kovalev for the food world map and t-SNE projections. Also, check our video at PyData Amsterdam 2017!

References

[1] LeCun, Yann, et al. "Handwritten digit recognition with a back-propagation network." Advances in neural information processing systems. 1990.

[2] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[3] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

[4] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.

[5] Jia Deng et al. Hedging Your Bets: Optimizing Accuracy-Specificity Trade-offs in Large Scale Visual Recognition, CVPR 2012

[6] Zhicheng Yan et al. HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale Visual Recognition, CVPR 2014

[7] Deng, Jia, et al. "Large-scale object classification using label relation graphs." European Conference on Computer Vision. Springer, Cham, 2014.

[8] Geoffrey Hinton et al. Distilling the Knowledge in a Neural Network, NIPS 2014 workshop

[9] Ning Qian, On the momentum term in gradient descent learning algorithms, Neural Networks : The Official Journal of the International Neural Network Society, 1999

[10] Hinton, Geoffrey, Nitish Srivastava, and Kevin Swersky. "Lecture 6a overview of mini–batch gradient descent." Coursera Lecture slides https://class. coursera. org/neuralnets-2012-001/lecture, Online (2012).

[11] Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of Machine Learning Research 9.Nov (2008): 2579-2605.

It’s a well-known trope that if your code doesn’t run, it probably doesn’t work. This is particularly true of error handling, error recovery and failover code, and has lead to the development of infrastructure (like Netflix’s Simian army) in which operators make sure that their systems are constantly in failure mode, ensuring that these code paths and mechanisms are being exercised continuously. In addition, at the human level, it forces a "downstream" developer to deal with failure scenarios resiliently and gracefully.

In this post, we’ll show how we apply similar reasoning to the slightly different domain of (web) server tuning at Booking.com. We run our web servers on bare metal and, consequently, we need to decide how many processes to run on a single machine, how many requests each worker serves and how to control its memory use. In our postmortem process (modelled in part on Etsy’s) it turned out that instability in this configuration caused a few outages, including a user-facing one, and we set out to solve this for good.

The quirks of our system

Let’s have a look at the properties that define our system. Our web application servers listen for external requests on an nginx process, which forwards its requests over a unix socket to a Plack server managed by uWSGI. At server startup, the latter process loads code and some shared data, and then uses fork(2) to spawn processes that actually render http responses. We call these the worker processes.

The applications are written in Perl. The perl interpreter uses reference counting for memory management, so the accumulation of unavoidable accidental reference cycles leads to real memory leaks. In addition, some of our developers use singletons or global variables as local in-memory caches, and these usually have poor expiry/eviction policies¹. For these reasons, we have to anticipate the worker processes continuously growing in memory footprint.

We cannot allow them to grow without restrictions; eventually the linux OOM killer would forcefully terminate the worker, which would likely be in mid-request². Not only would that give a bad experience for our user, we would also lose the monitoring data that we send at graceful worker exit.

For this reason, we have two thresholds in place: max_requests controls how many requests are being served by a worker before it gracefully exits. The threshold max_vmsize controls how much a worker’s memory footprint is allowed to grow before it exits. We consider a system to be operating normally if all workers exit gracefully due to the max_requests threshold; when workers start failing because of max_vmsize, we see that as an early warning sign for memory pressure.

Phase transitions

Water turns to ice exactly when the temperature changes from positive to negative (on the Celsius scale): a very small change in an input parameter changes the large-scale behaviour of the material. This is known as a phase transition. The system I described above has a similar property.

The memory usage of our worker processes usually grows rather slowly: even when their initial size is close to the threshold, they may still serve ample requests before exiting. The initial size is a bit more volatile, as it it may depend on new code being deployed, new data in tables that are being pre-loaded, or a number of other reasons.

The outages happened because a small change brought initial memory size to suddenly exceed max_vmsize. This brought the system from serving hundreds of requests to only a single request per worker. Before the change, the request throughput was bounded by CPU and by database roundtrips. After the change, the request throughput was bounded by forking and copy-on-write.

This is an example of phase transition³ in server configuration: the system’s behaviour is unstable under a small change in memory use. Not all phase transitions cause outages, but in this case the system performed incredibly poorly under the new conditions, making that previous comparison to "from fluid to frozen" even more apt.⁴

Recognizing phase transitions in your code

Our first action was to add better monitoring and alerting to memory usage and worker exit reasons, hoping to see these trends before they become an issue.

However, I’m now convinced that the root cause of the outage was not that we missed the gradual increase in memory usage, but that our way of controlling memory has an inherent phase transition. And it’s very easy to recognize: whenever your code contains

if ($process_parameter>= $TUNING_PARAMETER) {<take a management decision>}

and you expect it not to trigger in normal operation, that’s probably a phase transition. You’ll notice such a phase transition in my description above:

We consider a system to be operating normally if all workers exit gracefully due to the max_requests threshold; when workers start failing because ofmax_vmsize, we see that as an early warning sign for memory pressure.

At first look, this description seems like a fine way to run your servers; but if you know about phase transitions, the issue is as clear as day. By means of a contrasting example, a stable system might have half of the workers exiting for one reason, and half of them for the other⁵. Any change in memory conditions will change that balance, but the system would most likely still be running smoothly. Meanwhile, the change can be addressed at leisure, instead of addressing it in the stressful situation where our customers cannot book their holiday.

Avoiding phase transitions in practice

So far, we haven’t provided a solution. We only know how to recognise this potential failure mode: any threshold that kicks in new behaviour in an exceptional circumstance is suspect.

With this in mind, we went back to the drawing board. The solution we rolled out was based on the following three insights:

Workers shouldn’t run too long: worker terminations should happen continuously over the course of a business day.
Workers shouldn’t run too briefly: if too many workers need to be spawned, the forking overhead will dominate throughput.
If a server is under memory pressure, we should rotate through our workers more quickly. If it isn’t, we may as well run them for longer to reduce the processing capacity spent(/wasted) on forking overhead.

We made ballpark estimates for (1) and (2):

If a worker runs for half an hour, there will be plenty of terminations in a business day even on services that don’t have many workers.
If we spawn fewer than 1 process/second, forking overhead won’t dominate a server’s performance.

We then use a server’s current memory pressure (3) to linearly interpolate between (1) and (2). This yields a target lifetime T for a worker. If we know the time t for a single request, then R=t/T is how many requests a worker should serve. After every request, a worker exits with probability 1/R. There may be some variability in request processing time, and so these probabilities may be quite different after different requests, but because of the probabilistic nature, the average behaviour comes out the same. The actual code is a bit more complicated (a full listing can be found at the end of this article).

This solution has two additional trustworthy properties:

We are tuning the behaviour of the ensemble of worker processes, not the behaviour of any given individual process.
Parameters are tuned by what they are supposed to achieve at a high level, not by how they achieve that end in practice.

Indeed, max_requests and max_vmsize needed to be tuned for every server role separately, and they may change over time. Maximum fork rate and maximum average life time are tuned the same for our entire infrastructure and are expected basically not to change.

Discussion

We now have a solution that manages average worker behaviour much more smoothly than what we had before. However, we have made no attempt at defining or bounding the variability. This may be particularly important if we expect to be close to high memory use, or if we care for the effect of forking on high-percentile (i.e. long tail) latencies.

We have also made no attempt at tuning processing cores, RAM, and workers dedicated to a particular service, which is still a manual process for our major services. Our current service-deployment-as-a-service efforts will take a stab at that.

In a wholly different direction, we could (1) use a garbage collector to find reference cycles and (2) enforce proper eviction policies for in-memory caches. This has the potential for rendering the entire problem moot: workers would only need to terminate when new code is deployed. I got my hands dirty for (1) but (2) would likely take a prohibitive effort compared to what I described so far.

Acknowledgements

This code couldn’t have been deployed without the valuable insights from Andrei Vereha, Ivan Paponov and Quim Rovira. The readability of this post has been greatly improved by the efforts and expertise of Steven Baguley. In addition, Steffen Mueller and Tom van der Woerdt are two people whose judgment I trust over my own, and their ideas and support were invaluable.

Code listing

sub worker_should_exit{my$MAX_AVERAGE_LIFETIME_S=60*30;# half an hourmy$MAX_AVERAGE_FORKS_PS=1;# one fork / secondmy$UWSGI_PROCESSES=UWSGI_NUMBER_OF_CHILDREN;# percentage of memory used on the entire boxmy$memory_pressure;my$meminfo=Bookings::Tools::MemInfo::get_system_memory_info;if(defined$meminfo->{MemFree}&&defined$meminfo->{Cached}&&defined$meminfo->{Buffers}){# NOTE: As soon as we are running CentOS 7 everywhere, we can replace the above by the more accurate##     my $memory_pressure = 1 - $meminfo->{MemAvailable}/$meminfo->{MemTotal};## See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34e431b0ae398fc54ea69ff85ec700722c9da773$memory_pressure=1-($meminfo->{MemFree}+$meminfo->{Cached}+$meminfo->{Buffers})/$meminfo->{MemTotal};}if(!defined$memory_pressure||$memory_pressure<0||$memory_pressure>1){# safe default when readings are untrustworthy. I expect this to# never happen.$memory_pressure=0.5;warn"Worker cannot obtain proper memory information. This is very unexpected; please contact Timo Kluck or the www.sre team";grumble"weird_meminfo"=>$meminfo;}# consider 90% memory use or above as "full memory pressure"my$capped_memory_pressure=min(1,$memory_pressure/0.9);my$request_processing_time_s=max($wall/1000,0.001);my$idle_time_s=$CHILD_TIME_SPENT_IDLE_SINCE_LAST_REQUEST//0;# since we are tuning how many worker exits are happening in any given time window,# we also need to count any time that we spend idle because other workers are handling# the requests. However, we should *not* count any time we spend idle because no# requests are coming in. We achieve this by capping the total recorded time by##     <#workers> x <time for a single request>.#my$idle_time_capped_s=min($idle_time_s,$request_processing_time_s*($UWSGI_PROCESSES-1));my$delta_s=$request_processing_time_s+$idle_time_capped_s;# probabilities to terminate computed based on either extremitymy$reqs_per_worker_capped_by_lifetime=$MAX_AVERAGE_LIFETIME_S/$delta_s;my$reqs_per_worker_capped_by_forkrate=$UWSGI_PROCESSES/($delta_s*$MAX_AVERAGE_FORKS_PS);# the closer we are to max memory use, the faster we want to cycle# through our workers.## Note: in some cases, we have one mis-behaving worker that is causing# the memory use. Conceivably, we could find it and kill it. That does# need some tuning: when, exactly, is a worker misbehaving?## Instead, it's probably easier and safer to just rely on the worker# being recycled within (on average)# $UWSGI_PROCESSES / $MAX_AVERAGE_FORKS_PS seconds.my$reqs_per_worker=$capped_memory_pressure*$reqs_per_worker_capped_by_forkrate+(1-$capped_memory_pressure)*$reqs_per_worker_capped_by_lifetime;my$exit=(rand($reqs_per_worker)<1)?1:0;return$exit;}

I'm certainly guilty of this myself. ↩
It might even forcefully terminate something else on the system. ↩
I found a couple of references to phase transitions in this context. ↩
Interestingly, this failure mode is arguably worse (except maybe for debuggability) than the OOM killer we are trying to avoid, because the latter would likely only affect a few workers at a time.) ↩
In fact, the actual solution we chose did away with different exit reasons altogether (as you’ll see in a bit). ↩