It’s a well-known trope that if your code doesn’t run, it probably doesn’t work. This is particularly true of error handling, error recovery and failover code, and has lead to the development of infrastructure (like Netflix’s Simian army) in which operators make sure that their systems are constantly in failure mode, ensuring that these code paths and mechanisms are being exercised continuously. In addition, at the human level, it forces a "downstream" developer to deal with failure scenarios resiliently and gracefully.
In this post, we’ll show how we apply similar reasoning to the slightly different domain of (web) server tuning at Booking.com. We run our web servers on bare metal and, consequently, we need to decide how many processes to run on a single machine, how many requests each worker serves and how to control its memory use. In our postmortem process (modelled in part on Etsy’s) it turned out that instability in this configuration caused a few outages, including a user-facing one, and we set out to solve this for good.
The quirks of our system
Let’s have a look at the properties that define our system. Our web application
servers listen for external requests on an nginx process, which forwards its
requests over a unix socket to a Plack server managed by uWSGI. At server
startup, the latter process loads code and some shared data, and then uses
fork(2)
to spawn processes that actually render http responses. We call these
the worker processes.
The applications are written in Perl. The perl interpreter uses reference counting for memory management, so the accumulation of unavoidable accidental reference cycles leads to real memory leaks. In addition, some of our developers use singletons or global variables as local in-memory caches, and these usually have poor expiry/eviction policies1. For these reasons, we have to anticipate the worker processes continuously growing in memory footprint.
We cannot allow them to grow without restrictions; eventually the linux OOM killer would forcefully terminate the worker, which would likely be in mid-request2. Not only would that give a bad experience for our user, we would also lose the monitoring data that we send at graceful worker exit.
For this reason, we have two thresholds in place: max_requests
controls how
many requests are being served by a worker before it gracefully exits. The
threshold max_vmsize
controls how much a worker’s memory footprint is allowed
to grow before it exits. We consider a system to be operating normally if all
workers exit gracefully due to the max_requests
threshold; when workers start
failing because of max_vmsize
, we see that as an early warning sign for
memory pressure.
Phase transitions
Water turns to ice exactly when the temperature changes from positive to negative (on the Celsius scale): a very small change in an input parameter changes the large-scale behaviour of the material. This is known as a phase transition. The system I described above has a similar property.
The memory usage of our worker processes usually grows rather slowly: even when their initial size is close to the threshold, they may still serve ample requests before exiting. The initial size is a bit more volatile, as it it may depend on new code being deployed, new data in tables that are being pre-loaded, or a number of other reasons.
The outages happened because a small change brought initial memory size to
suddenly exceed max_vmsize
. This brought the system from serving hundreds of
requests to only a single request per worker. Before the change, the request
throughput was bounded by CPU and by database roundtrips. After the change, the
request throughput was bounded by forking and copy-on-write.
This is an example of phase transition3 in server configuration: the system’s behaviour is unstable under a small change in memory use. Not all phase transitions cause outages, but in this case the system performed incredibly poorly under the new conditions, making that previous comparison to "from fluid to frozen" even more apt.4
Recognizing phase transitions in your code
Our first action was to add better monitoring and alerting to memory usage and worker exit reasons, hoping to see these trends before they become an issue.
However, I’m now convinced that the root cause of the outage was not that we missed the gradual increase in memory usage, but that our way of controlling memory has an inherent phase transition. And it’s very easy to recognize: whenever your code contains
if ($process_parameter>= $TUNING_PARAMETER) {<take a management decision>}
and you expect it not to trigger in normal operation, that’s probably a phase transition. You’ll notice such a phase transition in my description above:
We consider a system to be operating normally if all workers exit gracefully due to the
max_requests
threshold; when workers start failing because ofmax_vmsize
, we see that as an early warning sign for memory pressure.
At first look, this description seems like a fine way to run your servers; but if you know about phase transitions, the issue is as clear as day. By means of a contrasting example, a stable system might have half of the workers exiting for one reason, and half of them for the other5. Any change in memory conditions will change that balance, but the system would most likely still be running smoothly. Meanwhile, the change can be addressed at leisure, instead of addressing it in the stressful situation where our customers cannot book their holiday.
Avoiding phase transitions in practice
So far, we haven’t provided a solution. We only know how to recognise this potential failure mode: any threshold that kicks in new behaviour in an exceptional circumstance is suspect.
With this in mind, we went back to the drawing board. The solution we rolled out was based on the following three insights:
- Workers shouldn’t run too long: worker terminations should happen continuously over the course of a business day.
- Workers shouldn’t run too briefly: if too many workers need to be spawned, the forking overhead will dominate throughput.
- If a server is under memory pressure, we should rotate through our workers more quickly. If it isn’t, we may as well run them for longer to reduce the processing capacity spent(/wasted) on forking overhead.
We made ballpark estimates for (1) and (2):
- If a worker runs for half an hour, there will be plenty of terminations in a business day even on services that don’t have many workers.
- If we spawn fewer than 1 process/second, forking overhead won’t dominate a server’s performance.
We then use a server’s current memory pressure (3) to linearly interpolate between (1) and (2). This yields a target lifetime T for a worker. If we know the time t for a single request, then R=t/T is how many requests a worker should serve. After every request, a worker exits with probability 1/R. There may be some variability in request processing time, and so these probabilities may be quite different after different requests, but because of the probabilistic nature, the average behaviour comes out the same. The actual code is a bit more complicated (a full listing can be found at the end of this article).
This solution has two additional trustworthy properties:
- We are tuning the behaviour of the ensemble of worker processes, not the behaviour of any given individual process.
- Parameters are tuned by what they are supposed to achieve at a high level, not by how they achieve that end in practice.
Indeed, max_requests
and max_vmsize
needed to be tuned for every server
role separately, and they may change over time. Maximum fork rate and maximum
average life time are tuned the same for our entire infrastructure and are
expected basically not to change.
Discussion
We now have a solution that manages average worker behaviour much more smoothly than what we had before. However, we have made no attempt at defining or bounding the variability. This may be particularly important if we expect to be close to high memory use, or if we care for the effect of forking on high-percentile (i.e. long tail) latencies.
We have also made no attempt at tuning processing cores, RAM, and workers dedicated to a particular service, which is still a manual process for our major services. Our current service-deployment-as-a-service efforts will take a stab at that.
In a wholly different direction, we could (1) use a garbage collector to find reference cycles and (2) enforce proper eviction policies for in-memory caches. This has the potential for rendering the entire problem moot: workers would only need to terminate when new code is deployed. I got my hands dirty for (1) but (2) would likely take a prohibitive effort compared to what I described so far.
Acknowledgements
This code couldn’t have been deployed without the valuable insights from Andrei Vereha, Ivan Paponov and Quim Rovira. The readability of this post has been greatly improved by the efforts and expertise of Steven Baguley. In addition, Steffen Mueller and Tom van der Woerdt are two people whose judgment I trust over my own, and their ideas and support were invaluable.
Code listing
sub worker_should_exit{my$MAX_AVERAGE_LIFETIME_S=60*30;# half an hourmy$MAX_AVERAGE_FORKS_PS=1;# one fork / secondmy$UWSGI_PROCESSES=UWSGI_NUMBER_OF_CHILDREN;# percentage of memory used on the entire boxmy$memory_pressure;my$meminfo=Bookings::Tools::MemInfo::get_system_memory_info;if(defined$meminfo->{MemFree}&&defined$meminfo->{Cached}&&defined$meminfo->{Buffers}){# NOTE: As soon as we are running CentOS 7 everywhere, we can replace the above by the more accurate## my $memory_pressure = 1 - $meminfo->{MemAvailable}/$meminfo->{MemTotal};## See https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=34e431b0ae398fc54ea69ff85ec700722c9da773$memory_pressure=1-($meminfo->{MemFree}+$meminfo->{Cached}+$meminfo->{Buffers})/$meminfo->{MemTotal};}if(!defined$memory_pressure||$memory_pressure<0||$memory_pressure>1){# safe default when readings are untrustworthy. I expect this to# never happen.$memory_pressure=0.5;warn"Worker cannot obtain proper memory information. This is very unexpected; please contact Timo Kluck or the www.sre team";grumble"weird_meminfo"=>$meminfo;}# consider 90% memory use or above as "full memory pressure"my$capped_memory_pressure=min(1,$memory_pressure/0.9);my$request_processing_time_s=max($wall/1000,0.001);my$idle_time_s=$CHILD_TIME_SPENT_IDLE_SINCE_LAST_REQUEST//0;# since we are tuning how many worker exits are happening in any given time window,# we also need to count any time that we spend idle because other workers are handling# the requests. However, we should *not* count any time we spend idle because no# requests are coming in. We achieve this by capping the total recorded time by## <#workers> x <time for a single request>.#my$idle_time_capped_s=min($idle_time_s,$request_processing_time_s*($UWSGI_PROCESSES-1));my$delta_s=$request_processing_time_s+$idle_time_capped_s;# probabilities to terminate computed based on either extremitymy$reqs_per_worker_capped_by_lifetime=$MAX_AVERAGE_LIFETIME_S/$delta_s;my$reqs_per_worker_capped_by_forkrate=$UWSGI_PROCESSES/($delta_s*$MAX_AVERAGE_FORKS_PS);# the closer we are to max memory use, the faster we want to cycle# through our workers.## Note: in some cases, we have one mis-behaving worker that is causing# the memory use. Conceivably, we could find it and kill it. That does# need some tuning: when, exactly, is a worker misbehaving?## Instead, it's probably easier and safer to just rely on the worker# being recycled within (on average)# $UWSGI_PROCESSES / $MAX_AVERAGE_FORKS_PS seconds.my$reqs_per_worker=$capped_memory_pressure*$reqs_per_worker_capped_by_forkrate+(1-$capped_memory_pressure)*$reqs_per_worker_capped_by_lifetime;my$exit=(rand($reqs_per_worker)<1)?1:0;return$exit;}
I'm certainly guilty of this myself. ↩
It might even forcefully terminate something else on the system. ↩
I found a couple of references to phase transitions in this context. ↩
Interestingly, this failure mode is arguably worse (except maybe for debuggability) than the OOM killer we are trying to avoid, because the latter would likely only affect a few workers at a time.) ↩
In fact, the actual solution we chose did away with different exit reasons altogether (as you’ll see in a bit). ↩