Booking.com uses the Perl programming language heavily across its entire technical infrastructure. At the size of our infrastructure, even small performance improvements can translate into hefty savings, not to mention the exciting new features that we will implement with those spare CPU cycles. For this reason, we are very happy to announce that we are now funding Dave Mitchell, prominently known for his many years of high-quality contributions to the Perl language implementation, to improve the Perl runtime's performance one small step at a time.
The initial work started in February. Dave is working at the pace of his choosing and sharing some details on his progress on a monthly basis. The summary for February has already hit the development mailing list. Dave's summaries are necessarily concise as they're intended for experts. I'll expand on some of the changes here.
Dereferencing Overloaded Objects
Perl's support for overloading many common operations has ramifications
throughout much of the runtime. Part of the strategy to keep
the overhead of proliferating overload checks down is to encode
the fact if there is any overloading into a flag on any given value which
is inexpensive to check. In recent versions of Perl, bug fixes have exacerbated
the performance impact in certain situations.
In a nutshell, when a reference carries overloading for one operation
(such as stringification) but not for another (such as dereferencing),
the latter operation should not be slowed down significantly.
Dave's work has improved the dereferencing of
objects (like $obj->{foo}
) which have overloading. The other
operations no longer have such a dramatic performance penalty.
This means that we can have the more faithful overload support of recent Perls and have it be efficient, too.
Faster Array Access (Sometimes)
Perl comes with an optimization that makes array accesses faster if done with constant indexes. That means this
$foo[3]
is faster than this.
$foo[$bar]
This likely comes as a surprise to few. Alas, the optimization strikes a run-time/space
trade-off in that it only has 8 bits to store constant indexes in. If you think outside the
box just a little bit, you'll realize that this is not a huge limitation. Code
that does $array[5913]
is rare. Code like $array[0]
, that is small constant indexes,
is the overwhelmingly most common case of constant-index accesses.
Dave's change is as simple as it is effective. Previously, indexes in the range between
0 and 255 were optimized. Dave changed this to be between -128 and 127. This is
a net win because again, $array[-1]
is common to get the last element, whereas
$array[200]
is very rarely seen with a literal constant.
Avoid Doing Meaningless Work in the Runtime
The output of the Perl compiler is an in-memory structure called
the OP-tree. It mostly resembles the syntax of the Perl code. In fact it does
so much that we can have tools like B::Deparse
that recreate something very close to the original source.
During a late compilation phase, Perl builds a separate linked list
of these OP structures that represents the order in which they will be
executed. The optimizer can determine that some of the OPs in this
so-called execution chain actually serve no purpose. Depending on their
type and location, it may choose to either patch them out of the chain
completely (great!) or to replace them with no-ops (OPs of type OP_NULL
). The latter happens
when the optimizer is not smart enough to figure out all entry points
that could lead to executing the soon-to-be-ex-OP.
Dave's work in this area means there are no more OPs of type OP_NULL
that
are ever executed. That is, any optimization now always
manages to apply the more thorough elimination of needless OPs.
This change still resides on a branch for testing.
Less Overhead on Common Hash Operations
In a nutshell, this work is going to make common initial
operations on hashes more efficient. These are of the "enumerate all entries" variety:
keys %hash
, values %hash
and @array = %hash
.
My colleague Yves Orton contributed a nice explanation of the work
that Dave has put into this. Dave and Yves have been cooperating on
this quite extensively:
Dave's keys()
optimization is much deeper than meets the eye.
Understanding what it does requires a little insight into how Perl's
hash data structures are organized internally and a bit of their
history.
Historically a hash was a fairly complex structure, which
included preallocated storage for iterator state. This meant they
were large. At some point it was realized that many hashes are
never iterated. This meant that we could lazily allocate the iterator
structure. The way we chose to do this was to embed it at the end
of the bucket array, forcing us to realloc()
the bucket array on
"first iteration". This has the advantage that hashes which are not
traversed are smaller in memory. The trade off is that the realloc()
can be slow.
Dave's patch takes advantage of the observation that
there is no need for per-hash iterator state when running keys()
and similar operations.
We traverse the whole hash in one go, so the iterator need not be
stored against the hash, and instead a preallocated structure can
be used by keys()
for its purposes.
Until you consider hash order randomization, things are really that
simple.
But it so happens that keys()
and each()
must produce the same
output order. The per-hash
randomization data is stored in the iterator structure, so there is
no way for the optimized keys()
to guarantee the same order as a subsequent
each()
would (vice versa is not a problem as each()
must create
the iterator struct). The solution is to figure out how to store the
randomization data in the base structure.
Unfortunately, this isn't so easy. All members of the existing structure appear fully used, so we need to figure out how to squeeze more bits of data in without expanding space usage needlessly. Multiple options are still on the table. The structure currently looks like this:
structxpvhv{HV*xmg_stash;/* class package */union_xmguxmg_u;STRLENxhv_keys;/* total keys, including placeholders */STRLENxhv_max;/* subscript of last element of xhv_array */};
At the most abstract what we want to do is add another STRLEN
(U32
) member called
xhv_rand
to the structure, but we want to do it without making the structure larger.
Now it turns out that xhv_max
contains the size of the bucket array, and that
is always a power of two. That in turn means it uses 32 bits to store 5 bits
worth of information (for performance). So at the cost of some arithmetic we can
squeeze out 27 valuable bits of space. xhv_keys
contains the total number of keys,
which in the current implementation will always be smaller than xhv_max
, so
hypothetically we could somehow store both in the same field. There is also the
observation that the random data need only be as long (in terms of bits) as
xhv_max
is. This means that if we
could ensure that any hash with more than K
buckets must have a preallocated
iterator structure, then for smaller hashes we can store the max (or keys) and random
data in the same struct member.
An entirely different solution would be to use a hash of the address of the data structure as
the initializer to xhv_rand()
, this however has the potential to reduce security.
What was worked on was the logic to guarantee that hashes of a certain size have a preallocated
iterator structure, plus, the logic to pack the keys and randomization data into the xhv_keys
structure
when the number of buckets is smaller than 2**16
. This also included an implementation of hashing
the data structure address as the initializer.
The changes discussed here will not land in the main development branch (blead in perl parlance) before the Perl 5.20 release.
More Compact Code for Leaving Scopes
When leaving a lexical scope, perl has to do a fair amount of work such
as reference count related book keeping on things that have gone out of
scope. This is a very hot code path (the leave_scope()
function in scope.c
).
Dave tweaked part of this function and it now has smaller object code (97 bytes less).
The new version skips a lot of tests for "simple" lexical variables going
out of scope.
Micro-Optimizations for Perl Function Calls
Dave worked on micro-optimizing the implementation of the entersub
OP,
that is the OP that executes the majority of logic related to Perl
function calls. The common code paths of this hot function have
been tweaked, and branch prediction hints added where appropriate. The
changes reduce the object size of the function by 11%.
Summary
Most of the above changes all small tweaks, but an excellent start into the project. I'm incredibly happy that Dave is willing to work on Perl for us and to everyone's benefit.