Named Entity Classification

abstract: The customers of Booking.com communicate with us through different mediums. They perform queries on our search engine, provide us with reviews about their stay and describe their opinion about different destinations. All this communication creates an abundance of textual information — and it's a key part of our job to understand it. The first step towards this goal is the recognition of the named entities (the sequences of words in the text which correspond to categories such as cities, accommodation, facilities, etc.). In this blog post, we display a comparison of different approaches that can be used in order to tackle such a Named Entity Classification task.

1. Introduction

Booking.com customers are in constant commucation with our website and provide us with a plethora of different textual information in this process. Our customers "talk" to us in all steps of their journey; from the start of their experience while posing queries to our search engine, to long after they've returned from their trip, where they provide us with feedback about their stay and information about the place they visited.

All these interactions create a vast amount of structured and semi-structured information in textual format that contain valuable information about their experience on our website, on their accommodation and on the place they visited. It is of utmost importance for us to be able to understand the information on this large set of data, the essential building blocks for everything we do at Booking.com. First and foremost, we should recognise the entities in the text. One can treat this problem as a Named Entity Classification task.

This blog post describes three prototype solutions for the task of Named Entity Classification in the context of Booking.com. The aim is to present different approaches to the classification task, analyse their implementation and compare them in a small scale prototype use case. Sample code in Python is also provided in the following sections for each model described.

Fig. 1: Searching in Booking.com

Fig. 1: Searching in Booking.com

Fig. 2: Reviews in Booking.com

Fig. 2: Reviews in Booking.com

2. Models

Three approaches were followed in order to tackle the problem of Named Entity Classification. The first approach uses Structural SVM, the second Recurrent Neural Networks with Word Embeddings and the third using Learning2Search.

For the SVM approach, MITIE is used. MITIE is an open source natural language processing library focused on information extraction from MIT. The library uses state-of-the-art statistical machine learning.

The second approach follows a different path and utilizes RNNs with Word Embeddings. The approach has been proved to be successful for the Slot Filling task by Mesnil et al.[5] in a project in which University of Montreal and Microsoft Research collaborated. The task is a Spoken Language Understanding task and its aim is to assign a label to each word given a sentence.

Last but definitely not least, Learning2Search (L2S) [1] was used. It was created by the team that has build Vowpal Wabbit and was presented by Langford and Daume in ICML2015. L2S’s strategy follows a sequential decision making process and it’s usage in Named Entity Classification was presented in the tutorial.

Before discussing further about each approach it would be useful to provide an insight on the task and the data used by the models.

3. Problem and data

In this blog post, we analyze different state-of-the-art approaches and compare them in a small Named Entity Classification task.

The goal of this example task is to recognize the following labels in textual strings:

destinations (dest)
property types (prop_type)
facilities (fac)

The model should be able to process queries like the one above and place labels to the word they consist of. For instance, in the query “hotels amsterdam wifi” we would like to have the following labeling:

hotels:property type
Amsterdam:destination
wifi:facility

We build our synthetic training data by creating combinations and permutations of the words from our corpora of destinations, facilities, property types etc. Since, this work is a prototype to explore the potential models for performing Named Entity Classification we focused only on the English language.

4.Structural SVM approach

The first approach uses Structural SVM. MITIE (https://github.com/mit-nlp/MITIE) uses Structural SVM to perform named entity classification. It is a C++ library that provides APIs in C, C++, Java, R and Python (2.7).

It is open-source and has been proven to be in par with Stanford NLP on the Name Entity Recognition task using the CoNLL 2003 corpus (testb). MITIE displayed an F1 score of 88.10% while Stanford NLP 86.31% (https://github.com/mit-nlp/MITIE/wiki/Evaluation).

It is also fast in comparison to other models that attempt to solve the task of named entity recognition.

Fig. 3: Speed of MITIE VS other approaches

Fig. 3: Speed of MITIE VS other approaches

4.1.Under the hood

Before we move the implementation details of the model, it is useful to describe how MITIE works. The library is a bit tricky to read and the examples (https://github.com/mit-nlp/MITIE/tree/master/examples do not display what is happening “under the hood” clearly.

MITIE chunks each sentence into entities and each entity is labeled by a multi-class classifier. In order to classify each chunk, MITIE creates 500K dimensional vector which is the input to the multi-class classifier. The classifier learns one linear function for each class plus one for the “not an entity class”. The feature extraction source code can be found in the ner__feature_extraction.cpp file (https://github.com/mit-nlp/MITIE/blob/master/mitielib/src/ner_feature_extraction.cpp). It uses the Dlib toolkit(https://dlib.net/) which is used in C++ for machine learning.

Some of the features are the following:

Does it contain numbers?
Does it contain letters?
Does it contain letters and numbers?
Does it contain hyphens?
Does it have alternating capital letters in the middle?

If one has defined N classes, the classifier has 500K*(N+1) values to learn.

4.2.Code for MITIE

For the prototype the Python binding was used and the example code on how to perform named entity recognition that can be found in github (https://github.com/mit-nlp/MITIE/blob/master/examples/python/ner.py)

The library is language dependent and needs to learn the characteristics about a language in order to operate. It comes with already analysed English and Spanish models. The English one was used for the prototype (ner/mitie/MITIE-models/english/tota_word_feature_extractor.dat). The usage of an internal textual dataset from Booking.com data could be beneficial to the model.

If one would like to build such feature extractor, the wordrep tool can be used in a simple statement like:

wordrep -e a_folder_containing_only_text_files

Since all the above is always more worthy with some actual code, the following script could help someone to use the model. The construction of input to the model is simple: we have to define the range of each token within the sentence and the label that is assigned to it.

For instance, let’s assume that we have the example “hotel amsterdam wifi”. The code below displays how to add it to the trainer.

5.RNNs with Word Embeddings

Mesnil et al.[5] have displayed the performance of Reccurent Neural Networks with Word Embeddings in one of the major spoken language understanding problems (SLU); slot filling. They implemented the Elman-type RNN [2] and Jordan-type RNN [3].

5.1.The slot filling task

A lot of research has been conducted in the Semantic parsing in SLU. It comprises three well-defined tasks: domain detection, intent determination and slot filling. The majority of the approaches on the slot filling tasks attempt to perform sequence classification. Approaches that are based on Conditional Random Fields [4] have proven to be successful in the task.

A classic benchmark for this task has been the ATIS (Airline Travel Information System) dataset which has been collected by DARPA. The dataset follows the Inside Outside Beginning (IOB) representation https://en.wikipedia.org/wiki/Inside_Outside_Beginning.

An example from this dataset is the following.

Input: show flights from Boston to New York today
Output
- show: null
- flights: null
- from: null
- Boston: B-dept
- to: null
- New: B-arr
- York: I-arr
- today: B-date

5.2.Word Embeddings

Word Embeddings have been recently receiving more publicity since Google’s Word2Vec https://code.google.com/archive/p/word2vec/ uses them. Words are mapped to real-valued embedding vectors using corpus/vocabularies of text in an unsupervised way.

5.3.Context window

A word-context window is used in order to capture short-term temporal dependencies. The context window is used because there is no temporal feedback.

The context-window consists of the ordered concatenation of word embedding vectors. For instance, the following is an example for a context window of size of 3: $$w(t)=[hotel,\textbf{amsterdam},wifi]$$ $$\textbf{`amsterdam'} \rightarrow x_{amsterdam} \in R^d$$ $$w(t) \rightarrow x(t) = [x_{hotel},x_{\textbf{amsterdam}},x_{wifi}] \in R^{3d}$$

where $w(t)$ is the 3 word context window around the i-th word ‘amsterdam’, $x_{\textbf{amsterdam}}$ is the embedding vector of ‘amsterdam’, and $d$ is the dimension of the embedding vector.

$X(t)$ is the ordered concatenated word embeddings vector for the words in $w(t)$.

5.4.Two types of RNNs

Two variants of RNNs for modeling the slot sequences were used in the paper by Mesnil et al. [5]. In Elman-type RNN the output from the hidden layer at time t-1 is kept and fed back to the hidden layer at time t; this adds some kind of “virtual context nodes” in the process and enables the network to maintain and learn a summary of the past inputs. This enables the network to perform sequence-prediction that the standard feed-forward neural network cannot do.

Jordan-type RNNs are similar to Elman-type with the difference being on the use of context nodes. In Jordan-type the context nodes are fed from the output layer and not from the hidden as in Elman.

5.5.Results on Slot Filling

Mesnil et al. compared their approached against Logistic Regression models, CRF and Multilayer Perceptron. The major points of their results are the following:

Modes that use the sequential dependency outperform the models that do not
RNN models perform consistently better than the CRF model
Elman-type RNN’s that use past information performs very well, but the Elman-type RNN that uses future information does not, despite them being symmetric to each other. This can be explained due to the format of the ATIS dataset that has most of the information in the second half of the sentences.
For Elman-RNN the best window size was 3 for the forward model and 13 for the backward.
Jordan-type RNN provide to be more robust, especially the bi-directional version of it.

Table 1: RNN results on Slot Filling task

5.6.Code for RNN

Microsoft research released some code related to their work in githubhttps://github.com/mesnilgr/is13. In order to run, the code needs Theano (https://deeplearning.net/software/theano/). For this prototype, the Elman-type RNN was used using the sample code from the repository (https://github.com/mesnilgr/is13/blob/master/rnn/elman.py).

A similar process of building the training data as in MITIE has to be followed. The following script provides a sample code on how to train with the RNN.

6.VW Learning2Search

John Langford (Microsoft Research) and Hal Daume III (University of Maryland) presented the Learning2Search [2] approach in their tutorial in “Advances in Structured Prediction” in ICML2015 (https://hunch.net/~l2s/merged.pdf).

Learning to search method for solving complex joint prediction problems based on learning to search through a problem-defined search space.

The major difference of the learning2search (L2S) to the rest of models used in the state-of-the-art is on the way it approaches the task of structured prediction. The majority of the state-of-the-art approaches can be characterised as “global models”, having the advantage that they have clean underlying semantics and the disadvantage that they are computationally costly and introduce difficulties in the implementation. On the other hand, L2S treats the problem as a sequential decision making process.

Sequential decision making approaches have been recently used in dependency parsing and a few toolkits for NLP have been published such as nlp4j (https://github.com/emorynlp/nlp4j) from Emory University, and MaltParser (https://www.maltparser.org/) from Växjö University and Uppsala University.

6.1.Learning2Search vs Other Approaches

The goal for Learning2Search was to create a model that has the following characteristics:

Lower programming complexity
Good prediction accuracy
Efficiency in terms of both train and test speed

The following graph displays a comparison in terms of lines of code between Conditional Random Field approaches (CRFSGD, CRF++) and Structured SVM (S-SVM) and Learning2Search.

Fig. 4: Programming Complexity of L2S VS State-of-the-art

Fig. 4: Programming Complexity of L2S VS State-of-the-art The following graph displays a comparison in terms of accuracy and training time between Conditional Random Field approaches (CRFSGD, CRF++), Structured Perceptron(https://en.wikipedia.org/wiki/Structured_prediction), Structured SVM (S-SVM) and Learning2Search.

Fig. 5: Training time and accuracy of L2S VS State-of-the-art

Fig. 5: Training time and accuracy of L2S VS State-of-the-art

The following graph displays a comparison in terms of prediction time between Conditional Random Field approaches (CRFSGD, CRF++), Structured Perceptron( https://en.wikipedia.org/wiki/Structured_prediction), Structured SVM (S-SVM) and Learning2Search.

Fig. 6: Training time and accuracy of L2S VS State-of-the-art

Fig. 6: Training time and accuracy of L2S VS State-of-the-art

6.2.Code for VW L2S

Along with the ICML2015 tutorial, an iPython Notebook for L2S wasreleased. Be aware that one has to remove the –audit command in line 22 because it crashes the program.

Following is sample code that one can use for employing Learning2Search to our the Named Entity Classification problem.

7.Results

As described earlier, a small use case was built for the comparison of the three approaches. The top 10% of clicked destinations were used to build a sample dataset for the prototype models presented. The task for the models was to recognize the following labels:

destination (dest)
facility (fac)
property type (prop_type)

The whole dataset created using the different combinations of destinations, facilities and property types was around 200,000 rows and 20% of it was used as a test set to evaluate the models. Table 2 displays the results for all the three approaches for Named Entity Classification in our case. The major points from the comparison are the following:

L2S is by far the best model.
Structured-SVM (MITIE) performed better than the Elman RNN

It is also worthy to mention that the demand of resources between the three approaches has been vastly different. L2S has been by far the less demanding. It was also the faster model in terms of training time. Both the Structural SVM and RNN required close to 20 hours training using almost 100GB of memory, while L2S run in a Macbook Pro in around 15 minutes using 1GB of memory.

Table 2: Comparison of MITIE, Elman and L2S

8.Discussion and Conclusion

In this blog post, we displayed three different approaches for Named Entity Classification. An example problem was used in order to perform a comparison of them. There are different use cases that such models can be applied inside Booking.com and the models can be easily adjusted to recognise different kinds of labels.

It is important to notice that the models do solve the classification problem, and not the mapping problem/strategy of the terms to the inventory of Booking.com.

Summing up, one thing is clear: Named Entity Classification could really help tackle this problem by providing a better understanding of the various textual inputs of our customers, and fits well with the top priority of applied Data Science in Booking.com - to enhance the experience and satisfaction of Booking.com customers.

References

[1] Chang, K.-W., He, H., Daumé III, H., and Langford, J. Learning to search for dependencies. arXiv preprint arXiv:1503.05615 (2015).

[2] Elman, J. L. Finding structure in time. Cognitive science 14, 2 (1990), 179–211.

[3] Jordan, M. I. Serial order: A parallel distributed processing approach. Advances in psychology 121 (1997), 471–495.

[4] Lafferty, J., McCallum, A., and Pereira, F. C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.

[5] Mesnil, G., He, X., Deng, L., and Bengio, Y. Investigation of recurrent-neural-network architectures and learning methods for spoken lan- guage understanding. In INTERSPEECH (2013), pp. 3771–3775.