Quantcast
Channel: Booking.com dev blog
Viewing all articles
Browse latest Browse all 114

Sereal - a binary data serialization format

$
0
0

We introduce Sereal, a new, binary data serialization format that provides high-performance, schema-less serialization for Perl and other languages.

As with many things in computing, serialization of data structures is a game of trade-offs. There is an almost infinite number of desirable properties, but many of them are mutually exclusive. For this reason, at Booking.com, we have been using many different serialization technologies in different layers of our infrastructure. For improved interoperability and performance, there has been a push to migrate to one and only one technology that matches all our needs. After extensive review, none of the existing serialization protocols and implementations struck the right trade-offs for our use cases, so we had to design our own. Many questions were pondered to determine which characteristics are a necessity, which are important, and which are merely nice to have.

Decisions...

Do you want encoding performance? Decoding performance? Small encoding/decoding memory footprints? Compact output size? Take your pick, because a compact encoding or even compressed output is bound to impact at least your encoder performance negatively.

Proper streaming (eg. being able to read and resume reading partial documents)? Say good-bye to your decoder being able to seek within the data of one full serialized document.

A powerful protocol that serializes all of Perl's obscure data structures (aliases, weak references, globs, you name it)? Perfect round-tripping of all data types? A protocol that lends itself well to cross-language encoding and decoding? The cozier you become with Perl internals, the more awkward it becomes to implement and use the format in other languages. The more abstract you keep it, the more twisted the logic to map to camel guts.

A decoder that is robust to random garbage or maliciously crafted documents? Solved the halting problem recently? A protocol and encoder/decoder implementations that are backward-compatible? Maybe even forward-compatible? Good luck evolving your protocol in a forward-compatible manner. It'd be a first.

A monolithic piece of software that implements both en- and decoding, or rather multiple separate pieces of software that can be distributed as stand-alone tools to avoid flag-day upgrades? A protocol validator that has constant memory overhead?

Many choices leave enough space for many different serialization protocols and implementations to coexist without too harmful a competition. We just really wanted less fragmentation in-house.

As stated above, we reviewed the subset of characteristics that we absolutely required from a serialization tool, disregarding most of the nice-to-haves, and found no format or implementation that came sufficiently close. Therefore, we reviewed previous art, such as Perl's Storable, MessagePack, JSON, BSON, Google's ProtocolBuffers, Apache Thrift, and actually the output of Data::Dumper. Armed with that knowledge, we designed a serialization protocol that offers all the features we need and then some. During the design of the specification, and the implementation of the reference encoder and decoder, we aimed for high performance by keeping memory allocation counts and cache misses to a minimum. We are quite proud that the result is one of the most powerful, compact, and performant serialization methods available in Perl and, quite possibly, anywhere else.

The protocol design was made to encompass all of the most important concepts of Perl5, but we tried to keep cross-language compatibility in mind. In fact, there is a prototype of a Java en- and decoder that we intend to beef up for use on our Hadoop/Hive cluster. We would be delighted to see third-party implementations and are most willing to cooperate on and support such efforts.

Here are some of the objectives in designing the Sereal protocol and reference implementation:

  • References

We wanted to be able to serialize shared references properly. Many serialization formats do not support this out of the box.

  • Weak References

Given that perl uses a reference counting garbage collection scheme, Perl has the concept of a special type of reference called a "weakref" which is used to create cyclic reference structures which do not leak memory. Unlike most of the existing solutions, we need to handle these structures correctly, thus avoiding a perfectly valid data structure being converted to one that will cause a memory leak on a remote system. For cross-language compatibility, weak references can very easily be ignored by other decoder implementations.

  • Aliases

Perl supports aliasing - reusing the exact same basic blocks of any data structure. In Perl, these aliases are a special kind of reference which is effectively a C level pointer instead of a Perl language-level reference. We needed to be able to represent these as well.

  • Objects

Promoting a plain data structure reference to an object, as is customary in dynamic languages, can be dangerous in some circumstances. We needed to be able to serialize objects safely and reliably, and we wanted a sane control mechanism for doing so.

  • Regular Expression Objects

In Perl, a regular expression is a native type. We wanted to be able to serialize these at a native level without losing data such as modifiers.

  • Space Efficiencies

We want to be able to represent common structures as small as is reasonable. Although not to the extreme that this makes the protocol error prone and ludicrously difficult to implement. The steps taken include removing redundancy from the serialized structure (such as hash keys or class names) automatically. The protocol supports this kind of redundancy removal, but an encoder implementation can choose to which extent it makes use of the technique.

  • Speed Efficiencies

We want to be able to serialize and deserialize quickly. Some of the design decisions and trade-offs were aimed squarely at performance.

  • Separate Decoder and Encoder

We wanted to separate the functions of serializing from deserializing so they could be upgraded independently.

  • Forward/Backward Compatibility

We wanted the protocol to be robust to forward/backwards compatibility issues. It should be possible to partially read new formats with an old decoder, and output old formats with a new encoder.

  • Robustness

The Sereal decoder should never produce catastrophic failure (segmentation faults) when faced with invalid input data. Alas, that's a lot harder than it may sound.

  • Language Agnosticism

We want the format to be usable by other languages, especially dynamic languages. In aim of making this easier we have structured our repository so that implementations from other languages can be easily added, and we would welcome any contributions along these lines.

  • Openness

Less of a protocol-design concern, but we really wanted Sereal to be open for others to use and extend. It is an excellent example of mutual benefits. Despite the recency of the release, third parties have submitted requests for improvements and some CPAN authors have already begun to add Sereal support to their projects.

Sereal Performance

One of the design goals of Sereal is high run-time performance and small output size. Our reference implementation achieves both. Here is how it stacks up against some other popular serialization tools and formats in Perl:

Sereal::Encoder performance comparison

In this, we have Perl's Data::Dumper (dd), Data::Dumper::Limited (and optimized JSON-alike subset of Data::Dumper), the venerable Storable module, the fastest JSON library for Perl (JSON::XS), Data::MessagePack, Sereal in functional and OO interfaces, and finally Sereal with Snappy compression. The benchmark data is a rather large, nested data structure without any common sub-trees and without objects, but with repeated hash keys. Not an ideal case for Sereal. Nonetheless, the Sereal encoder beats all competitors by a sizable margin. The output size compares as follows:

Output size comparison

The size of the serialized output also compares favourably for Sereal. On this benchmark, it is marginally more compact than MessagePack - an achievement in itself. In data structures with common sub-trees, Sereal's ability to recognize and encode these would make it pull ahead even more.

Sereal::Decoder performance comparison

We see a similar picture in decoder performance: It edges out MessagePack, but beats all other contenders comfortably. There are some more benchmarks on the Sereal wiki.

What's next?

All code and the protocol specification are available from the official Sereal repository and are released under the same license as Perl itself (dual Artistic and GPL license).

At this point, we are using Sereal in production at Booking.com for our high-volume event/logging infrastructure on thousands of servers at Booking.com. Usage for all of our caching infrastructure is just around the corner.

Sereal is still under active development. We intend, but do not yet guarantee full backward-compatibility. We expect that some issues remain with respect to portability to systems of different endianness (we use little-endian only), possibly EBCDIC, or odd C type sizes. The reference encoder- and decoder implementation is likely not as portable yet as it needs to be.

The text of the format specification is not as complete as we would like it and could do with some review and extension to cover all corner cases that might currently be defined by the reference implementation. The most recent version of the Sereal specification can always be found in the official repository on Github.

Sereal was designed and implemented at Booking.com by Yves Orton, Steffen Müller, Damian Gryski, Chris Veenboer, Rafaël Garcia-Suarez, Ævar Arnfjörð Bjarmason, and others.


Viewing all articles
Browse latest Browse all 114

Trending Articles