At Booking.com, we're big fans of monitoring. With the usual battery of system-level monitors, we keep an eye on the health of our servers. As you would expect from any commercial internet company, we also keep track of various business metrics. However, we also collect data specifically for monitoring the performance of our applications and subsystems.
We're practitioners of frequent and liberal deployment (our twist on continuous deployment), and as such we need to quickly pinpoint any problem introduced by a specific code roll-out, or by the activation of a new feature -- either plain run-time application errors, or noticeable performance problems. But real-time monitoring is not sufficient; production problems can't always be traced back to a single root cause, and performance degradation can happen in many small steps.
In order to ensure that our systems are not drifting unnoticed into sluggishness, we use the data we collected in production to better understand how our applications behave in the real world, and to guide our tuning, profiling and optimisation efforts.
Measure performance in production
To better explain the kind of insight that we're expecting from this data, let's take a simplified example: the time needed to search for an available hotel room, for a specific day, in a specific city. The complexity and the performance of this query will depend on many factors such as: are there many hotels in that city? Is this query run for a well-known date for tourism, on which hotels might offer special deals (for example for a weekend stay) that wouldn't apply to different date ranges? Is there a lot of demand right now, meaning that the data relevant to the web page construction is more likely to be in hot caches, or on the contrary are we retrieving information about a small, uninteresting village in the center of France where no-one apparently wants to go much? Was destination been advertised ontwitter five minutes ago? Has the user already booked hotels in this destination? All these elements can mean that the page will need more or less resources to be built.
We log various metrics, such as the total wallclock of the request and CPU times used to generate a page, or the number of SQL queries or memcached requests consumed in doing so. From that raw data, it's possible to construct different, complementary pictures of the system.
Time series
A first, natural way of visualizing this data is to draw percentiles of those metrics as time series. Note that these are not real-time analytics: we are not in the context of monitoring a live system; we're looking at the trends of its evolution over time.
Why percentiles rather than averages? Because averages won't let you detect outliers. On the contrary, averages will hide the behaviour of the vast majority of pages behind a single number without statistical significance. This is very apparent in the graph below. This shows percentiles of memcached calls, per hour, over four days, for the main hotel page:
Here we can see that a small percentage of hotel pages were doing a disproportionate number of memcached calls. Turning a new feature on for all users and off again allowed us to pinpoint the problem that was subsequently fixed.
Volume profiles
Time series are nice, but there are other ways to visualize the global behaviour of a system while avoiding time decomposition at all. We can instead take a look at the system as a whole, viewing the complicated interconnects and dependencies based on their use. The idea for this came up while looking at a map of the traffic densities in Calcutta in 1913. Such a map does not show the evolution of traffic over time, but presents a synoptic view of the roads most frequently travelled, or where congestion is more likely to happen.
In our context, congestion typically happens between application servers and database or memcached servers. We have several kinds of application servers, the main types being for the http web site (www), where visitors search for hotels, and the https variety, where actual bookings take place. Those servers in turn connect to various kinds of database servers. We have several database schemas, depending on the nature of the data that is stored. For example, we have a database for hotel information (that is fairly static), another for room availability (which is much more volatile), a third one for actual reservations, another one for user information, and so on. The napkin drawing below gives a rough view of this architecture. The actual architecture is quite a bit more complicated, but this picture is sufficient to describe the simplified model on which the following measurements will be done.
For each page view, we gather the number of SQL queries done on each database schema. With this information we can visualize the volume of queries done by each type of page to each type of database, as a weighted dependency graph.
For that kind of graph, we used Circos which is a tool originally developed for biology applications, notably in the field of genome sequencing and comparison. Circos is free software, and is written in Perl. A simple Hive query allows to retrieve the number of DB calls per schema and per type of page for a whole day, and whether those queries are made on a master or on a slave. Pages are then grouped per subsystem. In the resulting graph below there are two subsystems, the bulk application servers (app for short on the graph) and the secure booking servers (book), corresponding respectively to our http and https sites.
The half-circle on the right lists the 10 most visited pages per role (the outer ring being blue for app, red for book), and the half-circle on the left lists the databases called by those pages (with a suffix of ro or rw indicating whether we're querying a read-only slave or a read-write master).
These kinds of graphs help to figure out the relative importance of the different types of pages, and the load they incur on the database servers. The relative size of ribbon widths reflects the total number of queries.
Many variations can be done on this principle. This graph shows the logical relations between functionalities and databases, but one could also group by physical hosts and subnets, to detect potential network-related effects; instead of weighting the ribbons by number of queries, you could use total wallclock time. There is no one-size-fits-all solution there; as we progress in insights we adjust our data sampling and collecting to pinpoint the places where the most gains can be (or have been) achieved.
We've used graphs like this specifically to help us determine where our more critical failure points are and mitigate them either through redundancy or strategically shuffling around some dependencies. And all of these graphs are based on the mining of our data gathered from production systems. As we continue to grow our infrastructure and our applications get larger, we will absolutely need to slice our gathered data in interesting and varied ways to make sure we don't miss any potential problems or optimizations.