PyData Amsterdam 2017 at Booking.com: Deep Learning, Statistical Models and NLP

In April, Booking.com hosted PyData Amsterdam 2017. The Booking.com headquarters was filled with 330 Python developers and data scientists from all over Europe, who gathered for a weekend full of talks and discussions all about using and evolving Python for Data Science applications. The atmosphere was wonderful, with interesting presentations, people meeting others from the PyData community, sharing experiences, problems and solutions, discussing future developments, and everything in between. As the Dutch would say: gezellig!

We had 32 talks at the conference covering a wide array of PyData-related subjects, from Deep Learning, to Data Visualization, to the Ethics of Machine Learning. Booking.com itself contributed three talks to the conference, which were similarly diverse: applying Deep Learning in production, how to diagnose statistical models, and on using NLP for song lyrics.

Deep Learning

Deep Learning is currently a hot topic, so it was no surprise that it was featured in almost a third of all the talks at this year’s PyData conference.

Representing Booking.com, Emrah Tasli and Stas Girkin dived into the complex problem of image understanding. Emrah showed how Booking.com’s unique corpus of millions of tagged photos enables us to train a deep convolutional neural network specialised to output image labels that are relevant to our exact problem. Stas took us through the technical details of scaling this to work for our millions of users daily, and how to test the direct benefits to our customers via A/B testing.

The range of Deep Learning topics covered at the PyData conference was very broad, and really gave us a sense of just how powerful this tool can be. For example, Mark-Jan Harte talked us through an application in the medical domain with his inspiring talk on “Training a TensorFlow model to detect lung nodules on CT scans”. Dafne van Kuppevelt covered a wide range of applications in her talk “Deep learning for time series made easy”, namely ecology/classifying bird activity, movement sensing/classifying human activity, and classifying epilepsy from EEG. As the title of the talk suggests, it was certainly refreshing to see a more beginner-friendly talk on the subject.

Diagnosing statistical models

Ever wonder why all your coefficients in your linear model turn up insignificant? Wonder no more! Lucas Bernardi shared some of his pragmatic Data Science tricks to diagnose statistical models in a clear-cut way. He elaborated on one of the possible reasons for the insignificance of coefficients: features that are not independent of each other (multicollinearity). As Lucas stressed, this problem should be tackled especially when the main goal is understanding and interpreting the model, rather than focusing on accurate predictions. He explained how to use a clustered correlation plot to find and deal with the multicollinearity of features in explanatory models.

The second topic Lucas covered was monitoring and diagnosing a classification model that is used in production. As an example, he chose the “Business vs Leisure” model used on the Booking.com website. In short: when a user does not indicate in the search box whether they are travelling for leisure or business, we still want to predict the probability that they are a business booker. In order to optimise the user experience, we might show different versions of our website depending on what this model predicts. The challenges that could occur in this live environment is that our data could be:

incomplete (not all the data ends up labelled, so there’s no way to evaluate all data against a ground truth);
delayed (the visitor might book only some time later);
dynamic (the label and feature space distributions change over time).

In this real word scenario, how can we monitor model performance, and diagnose any trouble? Lucas advocated the use of “Response Distribution Analysis”, which means looking at the the probability distribution of the model output over all of the presented examples. You could also call this the distribution of the probabilities of the probability to be in the positive class. Ideally, we want this to be a bimodal distribution, and use the “valley” between the peaks as the threshold value. To learn about the interpretation of more patterns in the response distribution, watch the recording at:

NLP on heavy metal lyrics

In a full schedule of 2 days of talks from 9 to 6, we were happy to have a few lightweight, fun and far-out talks too! Jon Paton showed us how English looks to non-English people in his talk on character level Markov models, “Simulate your language. ish.” Another talk in this line was Rogier van der Geer’s “Risk Analysis”. Contrary to what any financial analysts in the audience might have hoped for, we learned from Rogier’s talk how to win the board game Risk using genetic algorithms. Even one of the keynotes had a fun edge; in his presentation “Python versus Orangutan”, Dirk Gorissen shared his experiences with using Python to train drones to find orangutans in the rainforests of Borneo.

For Booking.com, Iain Barr showed that our Data Scientists don’t just care about holiday travel. Iain explained how he applied NLP to the song lyrics of metal bands. We’ll never forget his definitions of “metal-ness” of a word:

It turns out that using this simple idea - of comparing each word’s frequency in metal lyrics to its frequency in normal English - gives a pretty good measure of what we’d intuitively mean by “metal-ness”. So: the most metal word in the English language is ‘burn’, closely followed by ‘cries’, ‘veins’, and ‘eternity’. Want to know the least metal words? If you’re particularly (hint) interested then you can relatively (hint) easily check out Iain’s talk:

More PyData

You can find recordings of all talks from PyData Amsterdam 2017 here. Overall, the PyData Amsterdam 2017 conference was a great success and a learning experience for us. We learned a lot about Data Science and Python, hosting and organising a conference - and we had a lot of fun too. Here’s to PyData Amsterdam 2018!

PyData Amsterdam 2017 at Booking.com: Deep Learning, Statistical Models and NLP

Deep Learning

Diagnosing statistical models

NLP on heavy metal lyrics

More PyData

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112