Quantcast
Channel: Booking.com dev blog
Viewing all articles
Browse latest Browse all 114

The Next Sereal is Coming!

$
0
0

Sereal is a fast, compact, binary, schema-less serialization format that aims at dynamic languages' data structures and specifically supports many of Perl's complex structures and concepts.

Over the past months, we've been quietly hacking away at a new and improved version of the Sereal protocol. This article is to draw attention to the new developments and to encourage third-party language-implementors of Sereal to consider upgrading their code to support the new protocol version.

Users of Sereal can now take a deep breath and relax. The Perl reference implementation of Sereal transparently supports reading Sereal documents old and new and even the encoder sports an option to emit documents that follow the old "V1" protocol, so migration should be absolutely seamless, zero-downtime, and without any head-ache. The easiest way is to upgrade the decoder first, once there is an official release, and only then upgrade the encoder. Nothing else is required. Promised!

The changes in Sereal version 2 aren't radical. We're quite happy to say that the protocol turned out to be rather solid in the originally published version. The one modification that stands out is that document bodies are now relocatable within a Sereal document.

Let's take a step back to consider what that means and get some context. The bulk of a Sereal document is comprised of so-called tags, potentially followed by some data. For example, there is a tag for long byte-strings (called BINARY) which is followed by a length, and the bytes that make up the binary blob. Each tag is exactly one byte in size. As an optimization, we can use some of the bits in the tag-byte to indicate lengths. For example, SHORT_BINARY can be used to represent short (byte-)strings without encoding a separate length. In order to support complex data structures with multiple references to sub-structures (or even aliases), Sereal uses offsets: The REFP tag is followed by a byte-offset that encodes a reference to a previously (thus the P in REFP) encoded data structure.

Apparently, people were surprised that Sereal often beats MessagePack and other well-thought-out serialization formats in compactness. One reason is that it simply strikes different trade-offs that may be more suitable to the type of data being encoded. Another is that it allows the encoder to use the same trick that's used for real user-visible back-references (REFP) for internal space optimizations. Here, the encoder may choose to encode any given tag using the COPY tag. It's followed by such an offset that identifies which previously encoded element is to be decoded in place of it later. COPYs are not a user-visible feature (apart from the space savings) and are optional to the encoder implementation. But they are extremely important if you consider, for example, Perl's use of hashes/dictionaries for representing objects. Lots of repeated attribute names! The Sereal implementation for Perl will only encode these once.

Now, the significant change in Sereal V2 is that the meaning of offsets changes. For both references and internal copies they used to be relative to the very first character in the document. In Sereal V2, they are relative to the first character of the document body. In other words, relative to the location of very first tag. Let's look at how that works by inspecting actual Sereal documents. The Perl Sereal implementation comes with a handy debugging hack called hobodecoder. To wit:

$ perl -MSereal -e 'print encode_sereal("fooooo")' | \
  perl author_tools/hobodecoder.pl

0000000  61 115 114 108   1   0 102 102 111 111 111 111 111
          =   s   r   l 001  \0   f   f   o   o   o   o   o
0000015

Total length: 13

Sereal protocol version: 1
Empty Header.
000006: 06 102 SHORT_BINARY(6): 'fooooo'

So we just encoded a really simple data structure: The string "fooooo". Let's go through the hobodecoder output. First, we have a friendly hexdump of the document (od -tu1c for the curious). You can see the magic identifier string =srl at the beginning, followed by the protocol-version and document-flags byte, and a NUL byte indicating that there's no variable-length header data. Then comes the actual document body. The document was produced with an old encoder that produces V1 Sereal by default. The last line is a dump of the document body. At absolutely byte position six, we have a SHORT_BINARY tag, including the string length in bytes, followed by the actual six bytes of the binary string. That's all. Let's try something more interesting and dump a hash.

Dumping {fooooo => 1}:

000006: 51 081 HASHREF(2)
                 KEY:
000007: 06 102     SHORT_BINARY(6): 'fooooo'
                 VALUE:
000014: 01 001     POS: 1

At byte six, we have a HASHREF tag, which includes the number of elements in the hash (here: one key, one value). At byte seven, we have the now well-known binary string, and at byte 14, there's a positive integer 1. Got the hang of it? Yes? So let's try something more interesting. An array of two simple hashes that are like the above.

Dumping [ {fooooo => 1}, {fooooo => 1} ]:

000006: 42 066 ARRAYREF(2)
000007: 51 081   HASHREF(2)
                   KEY:
000008: 06 102       SHORT_BINARY(6): 'fooooo'
                   VALUE:
000015: 01 001       POS: 1
000016: 51 081   HASHREF(2)
                   KEY:
000017: 2f 047       COPY(8)
                   VALUE:
000019: 01 001       POS: 1

Whoops, just one string in there. Indeed, Sereal is working its magic by emitting a COPY tag instead of the string. This saves five bytes (over 20%) in this particularly simple document. The COPY(8) simply says "when decoding this document, put here whatever the document holds at byte 8". At byte 8, we have the SHORT_BINARY from the first hashref, when measured from the start of the document which is the = of the =srl magic string. In Sereal V2, that 8 becomes a 3 because the offset now refers to the start of the document body which happens to be the initial ARRAYREF byte. And in Sereal V2, offsets are 1-based instead of 0-based so really that is indeed 3 instead of 2.

Why 1-based offsets, you ask? Where the rest of the world has realized that FORTRAN got it all wrong and indexes starts at 0? Well, it's a very practical reason: it simplifies upgrading implementations to V2. Previously, there were never 0 offsets (why point outside the document body?), now there would be and changing the offset base preserves the property that all valid offsets are true.

Stepping away from the gory details again, what's the significance of this tiny change? When encoding a Sereal document from an in-memory data structure, we'd previously have to know exactly where the document body started in the final output in order to encode the body correctly. So in the above dump, if for some reason, we had to add in some more space for the Sereal header after the fact, and the body started at byte seven instead of byte six, we'd thoroughly break the COPY (and REFP) tags in the document body. What adds insult to injury is that we couldn't just scan the encoded document to update the offset numbers. Since they are encoded as varints, their length may change if we change the value. Alas, this is a clear design failure in protocol version 1.

Considered in isolation, being able to move the location of a document body in a Sereal document doesn't look like a significant win. The Perl/XS version of the Sereal V1 encoder is proof that one can efficiently implement encoding without worrying about this issue. Here is where things gets really interesting, though. This new behavior allows us to embed more than one document body into the same Sereal document. Specifically, in Sereal V2, you will be able to embed arbitrary user-defined (Sereal-encoded) data into the variable-width document header. This means that you can embed meta-data such as routing information in the header. With that, processes that just relay documents don't have to uncompress and deserialize an entire document just to find out that it is destined for somebody else. This makes framing really easy while not putting arbitrary restrictions on the information that you put into the header.

At Booking.com, the Sereal V2 changes will be a big hit. We use Sereal a lot to store arrays of event objects on disk. Meta operations such as finding the total number of events in a given directory previously required reading, decompressing, and deserializing a large number of large files, or using a separate meta-data storage. By embedding the event count into the header of each Sereal document, the work required to count event objects is trivially reduced to peeking at the header of each document and aggregating.

Have you tried Sereal for your serialization needs? Let us know what you think! Have you considered implementing Sereal for another programming language? Check the status of your favorite language's Sereal implementation in the Sereal github repository and get involved!

Sereal was originally designed and implemented at Booking.com by Yves Orton, Steffen Müller, Damian Gryski, Chris Veenboer, Rafaël Garcia-Suarez, Ævar Arnfjörð Bjarmason, Borislav Nikolov, and others.


Viewing all articles
Browse latest Browse all 114

Trending Articles