Abstract: Building the best travel experience for our customers in Booking.com often involves solving very challenging problems. One that appears very frequently is the $k$-Nearest Neighbours problem ($k$-NN). In simple words it can be stated as follows: Given a thing, find the $k$ most similar things. Depending on how thing and similar are defined, the problem becomes more or less complex. In this post we’ll treat the case where the things are represented by vectors, and the similarity of two things is defined by their angle. We’ll discuss solutions and present a practical trick to make it fast, scalable, and simple. All of it, thanks to maths.
1. Things and Similarity
Suppose that we are given a database with pictures of handwritten digits, and the task of finding similar digit handwriting. Every picture contains exactly one digit, but we don't know which one. For every given picture we want to find other pictures that contain the same digit written with a similar style.
First, we need to find a representation for the pictures that we can use to operate. In this case it will be simply a vector, the length d of the vector is the number of pixels in the picture, and the components are the RGB values of the pixels. This representation has the advantage of working both at the computational level, but also at the mathematical level.
Second, we need to define similarity. Since the pictures are represented by vectors, we can compute the angle between any two of them; if the angle is small, the vectors point in the same direction and the pictures are similar. On the other hand, if the angle is big, the vectors diverge from each other and the pictures are not similar. More formally, the similarity between two pictures represented by vectors X and Y is given by:
\begin{equation} \label{eq:sim} sim(X, Y) = cos(X, Y) = \dfrac{\langle X, Y \rangle}{||X|| ||Y||} = \dfrac{\sum_i^d{X_i Y_i}}{\sqrt{\sum_i^d{X_i^2}}\sqrt{\sum_i^d{Y_i^2}}} \text{ (1)} \end{equation}
This quantity has a very nice property: It will never be more than 1, or less than -1. Given two vectors, if their similarity is close to 1 then they are very similar; if it is close to -1 they are completely different. 0 is in the middle – not very similar, but not completely different either.
Let’s see this in action:
Fig. 1: Two very similar 4s, their similarity according to equation 1 is 0.85
Fig. 2: Two very different 4s, their similarity according to equation 1 is 0.11
Figure 1 shows a graphical representation of the vectors computed from two handwritten digit pictures. These two digits are very similar, and when their similarity is computed by their angle using equation 1 it gives 0.85, which is, accordingly, quite high. On the other hand, Figure 2 shows two quite different numbers; this time their similarity is 0.11, which is quite low but still positive – even though the writing style is very different, both pictures are still a 4.
These pictures were handpicked to illustrate the vector representation and the cosine similarity. We now move on to finding an algorithm that finds similarly handwritten digits.
2. A simple solution
Now that we know how to represent things and compute their similarity, let’s solve the k-NN problem:
- For every picture in our database, compute its associated vector.
- When the $k$-Nearest Neighbours for a picture are requested, compute its similarity to every other picture in the database.
- Sort the pictures by ascending similarity.
- Return the last $k$ elements.
This is a very good solution (especially because it works). Figure 3 shows this algorithm in action. Every row shows the top 9 most similar pictures to the first picture in the row. The first line captures very rounded 3s, the second inclined 3’s, the fourth line shows 2’s with a loop on the bottom, and the 5th line shows Z like 2’s. Notice that this algorithm has no information about what digit is in the picture (nor, for that matter, anything about what kind of things the picture has), but it nevertheless succeeds to group by digit, and even by typographic style.
But let’s take a closer look by analysing its computational complexity. Consider n pictures with d pixels each:
- Computing a feature vector is $O(d)$. As this is done for every picture, the first step is $O(nd)$.
- The similarity function is $O(d)$, and again this happens $n$ times, then the second step is also $O(nd)$.
- The third step – sort $n$ elements – is $O(nlogn)$
- Finally returning the last $k$ elements could be constant, but let’s consider it $O(k)$.
In total we get:
\begin{equation} O(nd) + O(nd) + O(nlogn) + O(k) \text{ (2)} \end{equation}
$k$ is very small compared to $n$ and $d$, so we can neglect the last term. We can also collapse the two first terms into one single $O(nd)$. Now, note that $logn$ is much smaller than $d$, for example, if we have 10 million pictures with 256 ($d$) pixels each, $logn$ would be 7, much smaller than $d$. That means we can turn the $O(nlogn)$ into another $O(nd)$. Therefore the computational complexity of this algorithm is $O(nd)$, that is, linear in both total the number of images in our database and the the number of pixels per picture.
Fig 3: k nearest neighbours for some pictures. Every row depicts the top 9 most similar pictures to the first picture in the row
3. Can we do better?
If we do computational complexity analysis, it is natural to ask ourselves whether we can improve. So let’s try.
One idea to consider is to use a heap to keep track of the $k$ most similar items. The heap would never be larger than $k$ so every insertion involves $O(logk)$ similarity computations ($O(d)$), so an insertion is $O(dlogk)$. Since there will be $n$ insertions, in total we get $O(ndlogk)$, which is not an improvement.
We could also try to exploit the fact that we do not need to sort the $n$ elements, just to get the top $k$. The algorithm would be exactly the same, but step 3 would be replaced by applying quick-select instead of sorting. This would change the $O(nlogn)$ term to $O(n)$, that gives $O(nd) + O(n)$ which is $O(nd)$, again, not an improvement.
The last idea we will consider is to use a Space Partitioning Tree (SPT). An SPT is a data structure that allows us to find the closest object to another object in logarithmic time. A priori this seems to be the right solution but there is a problem: SPTs can only operate under certain distance functions, specifically metric distances.
SPTs work with distances, not with similarities. But there is a very close relationship between similarity and distance. In the context of $k$-NN, for every similarity function there exists a distance function such that searching the $k$ most similar items is equivalent to searching the $k$ closest items using that distance function. Just multiplying the similarity by −1 gives such a distance. So now we have a cosine distance that we could use in an SPT, but unluckily this cosine distance is not a metric distance.
A metric distance is a distance that complies the following conditions:
- $distance(x, y) \geq 0$
- $distance(x, y)=0 \iff x=y$
- $distance(x, y)=distance(y, x)$
- $distance(x, z)\leq distance(x, y)+ distance(y, z)$
Cosine distance clearly violates the first condition, but this is easy to fix by just adding 1. The second and the third conditions are met. Finally the fourth condition is violated and this time we cannot fix it. Here is an example of 3 vectors that violate the fourth condition:
$x=(1,0), y=(1,1), z=(0,1)$. Then:
$distance(x,y)=1-\dfrac{\langle x, y \rangle}{||x||||y||} = 1-\dfrac{1}{\sqrt{2}}$
$distance(y, z)=1-\dfrac{\langle y, z \rangle}{||y||||z||} = 1-\dfrac{1}{\sqrt{2}}$
$distance(x, z)=1-\dfrac{\langle x, z \rangle}{||x||||z||} = 1$
$distance(x,y)+distance(y, z)= 1-\dfrac{1}{\sqrt{2}} + 1-\dfrac{1}{\sqrt{2}} = 2-\dfrac{2}{\sqrt{2}} = 0.58$
And then:
$distance(x, z) \not \leq distance(x,y) + distance(y,z)$
which proves that condition 4 is not met.
In the following sections we are going to show a trick to overcome this limitation.
4. Maths
Let’s introduce some properties of vectors that we’ll exploit later.
Cosine distance is invariant under Normalization
First, let’s make a few definitions:
- Norm of a vector $X$: Denoted by $\lVert X \rVert$ and computed as $\sqrt{\sum_i^d X_i^2}$. This is exactly the geometric length of the vector.
- Normalization of vector: Denoted by $\hat X$ and computed as $\dfrac{X}{\lVert X \rVert}$ which is just dividing every component of the vector $X$ by the norm of $X$.
A consequence of these two definitions is the following:
\begin{equation} \lVert \hat X \rVert =1 \end{equation}
Which says that the norm of a normalized vector is always 1. This property is quite obvious, but here is a proof:
$\strut \lVert\hat X \rVert^2=\strut\sum_i^d \hat{X_i}^2 ={\sum_i^d {\dfrac{X_i^2}{\lVert X \rVert^2}}} = {\dfrac{1}{{\lVert X \rVert}^2}\sum_i^ d X_i^2 } =\dfrac{1}{\lVert X \rVert}\lVert X \rVert = 1 $
And since $\strut \lVert\hat X \rVert$ must be positive we conclude that $\lVert \hat X \rVert =1$.
Another consequence is the following:
\begin{equation} \label{eq:invariant} cos(\hat X, \hat Y) = cos(X, Y) \text{ (4)} \end{equation}
In words, this means that the angle between two vectors doesn’t change when the vectors are normalized. Normalization only changes the length (the norm of the vector), not its direction, and therefore the angle is always kept. Again, here is the proof:
$cos(\hat X, \hat Y) = \dfrac{\langle \hat X, \hat Y \rangle}{\lVert \hat X \rVert \lVert \hat Y \rVert} = \langle \hat X, \hat Y \rangle = \langle \dfrac{X}{\lVert X \rVert}, \dfrac{Y}{\lVert Y \rVert} \rangle = \sum_i^d \dfrac{X_i}{\lVert X \rVert}\dfrac{Y_i}{\lVert Y \rVert} = \dfrac{\sum_i^d X_i Y_i}{\lVert X \rVert \lVert Y \rVert} = \dfrac{\langle X, Y \rangle}{\lVert X \rVert \lVert Y \rVert} = cos(X, Y)$
Now since $cos(\hat X, \hat Y) = cos(X, Y)$ then also $1-cos(\hat X, \hat Y) = 1-cos(X, Y)$, which means that the cosine distance between $X$ and $Y$ is exactly the same as the cosine distance between the normalized version of $X$ and $Y$.
From Euclidean to Cosine
The second property we need for this trick is the following:
If $X$ and $Y$ are vectors with norm 1 (unit vectors) then: \begin{equation} \label{eq:euclideaniscosine} \lVert X-Y \rVert = \sqrt{2-2cos(X,Y)} \text{ (5)} \end{equation}
This states that if $X$ and $Y$ are unit vectors then there is an exact relationship between the euclidean distance from $X$ to $Y$ and the angle between them.
The proof:
\begin{equation} {\lVert X-Y \rVert}^2 = X^2 - 2\langle X, Y \rangle + Y^2 \text{ (6)} \end{equation}
Since $X$ is a unit vector then $X^2 = \langle X,X \rangle = \sum_i^d X_i X_i = {\lVert X \rVert}^2 = 1$, likewise, $Y^2=1$ then we have: \begin{equation} {\lVert X-Y \rVert}^2 = X^2 - 2\langle X, Y \rangle + Y^2 = 2-2\langle X, Y \rangle \text{ (7)} \end{equation}
And since $X$ and $Y$ are unit vectors, dividing by their norm is dividing by one:
\begin{equation} {\lVert X-Y \rVert}^2 = X^2 - 2\langle X, Y \rangle + Y^2 = 2-2\langle X, Y \rangle = 2-2\dfrac{\langle X, Y \rangle}{\lVert X \rVert \lVert Y \rVert} = 2-2cos(X,Y) \text{ (8)} \end{equation}
Cosine ranking is equivalent to Euclidean ranking
By looking at equation 5 we can already see that if $1-cos(X,Y)$ is bigger, then $\lVert X-Y \rVert$ must be bigger. That means that if $Y$ gets away from $X$ in the euclidean space, it also does in the cosine space, provided both $X$ and $Y$ are unit vectors. This allows us to establish the following:
Consider three arbitrary $d$-dimensional vectors $X$, $A$ and $B$. (they don't need to be unit vectors). Then the following holds: \begin{equation} \textit{if } 1-cos(X,A)<1-cos(X,B) \textit{ then } \lVert \hat X- \hat A \rVert < \lVert \hat X- \hat B \rVert \text{ (9)} \end{equation}
This equation says that if the cosine distance between $X$ and $A$ is less than the cosine distance between $X$ and $B$ then the euclidean distance between $X$ and $A$ is also less than the euclidean distance between $X$ and $B$. In other words, if $A$ is closer to $X$ than $B$ in the cosine space, it is also closer in the euclidean space.
The proof: We start from the left hand side expression and apply operations to get to the right hand side expression.
\begin{equation} 1-cos(X,A)<1-cos(X,B) \implies 1-cos(\hat X,\hat A)<1-cos(\hat X,\hat B) \text{ (10)} \end{equation} cosine is invariant under normalization (see equation 4)
\begin{equation} 1-1cos(\hat X,\hat A)<1-1cos(\hat X,\hat B) \implies \sqrt{2-2cos(\hat X,\hat A)} < \sqrt{2-2cos(\hat X,\hat B)} \text{ (11)} \end{equation} doubling and taking squared root keeps the inequality
\begin{equation} \sqrt{2-2cos(\hat X,\hat A)} < \sqrt{2-2cos(\hat X,\hat B)} \implies \lVert \hat X- \hat A \rVert < \lVert \hat X- \hat B \rVert \text{ (12)} \end{equation} normalized vectors are unit vectors and equation 5
This is all the maths we need to apply the trick. Let’s see what is it about.
The $k$-NN Trick
The goal of this trick is to find a way to be able to use cosine similarity with a Space Partitioning Tree, that would give us $O(log n)$ time complexity, which is a huge improvement.
The idea is actually very simple: Since cosine similarity is invariant under normalization, we can just normalize all our feature vectors and the $k$-nearest neighbours to X will be exactly the same; but now our vectors are all unit vectors, which means that sorting them by cosine distance to X is exactly the same as sorting them by Euclidean distance to X, and since Euclidean distance is a proper metric we can use a Space Partitioning Tree and enjoy the logarithm of n. Here’s the recipe:
- Normalize all the feature vectors in the database
- Build a Space Partitioning Tree using the normalized vectors
- When the $k$ nearest neighbours to an input vector $X$ are requested:
- Normalize $X$
- look up the $k$-NN from the Space Partitioning Tree
6. Experiments
Experimentation is at the core of Product Development at Booking.com. Every idea is welcomed, turned into a hypothesis and validated through experimentation. And Data Science doesn’t escape that process.
In this case, the idea has been thoroughly described and supported with practical examples and even maths. But let’s see if reality agrees with our understanding. Our hypothesis is the following: We can improve the response time of the algorithm described in section 2 by applying the trick described in section 5 guaranteeing exactly the same results.
To test this hypothesis we designed an experiment that compares the time needed to solve the $k$-NN problem using the full scan solution with the time needed by the $k$-NN trick solution. The $k$-NN trick is implemented using two different Space Partitioning Trees: Ball Tree, and KD-Tree.
The database consists of handwritten digits pictures from MNIST. For n ranging from 5000 to 40000 randomly sampled n pictures from the original database; then applied the different solutions to the same sample, computing the 10 most similar pictures for 20 input pictures.
7. Results
The results of our experiment are summarized by Figure 4:
Fig 4: Comparison of the full scan solution (brute force) and the $k$-NN trick (norm euclidean ball tree, and kd tree) for different database sizes n
From the chart we can make several conclusions: First, the time complexity of the full scan solution is indeed linear in n as suggested by the blue dots. This confirms the theoretical analysis in section 2. Second, although it is hard to say if the $k$-NN trick based solution is logarithmic, it is clearly much better than the full scan, as suggested by the green and red dots. Third, the Ball Tree based solution is better than KD-Tree solution, though the reason for this fact is not clear and requires further analysis and experimentation. Overall, the experiment strongly supports the hypothesis.
8. The Trap
Every trick sets up a trap, and every gain in one aspect hides a loss in another. Being aware of these traps is key to successfully apply these tricks. Let’s see what trap the $k$-NN trick sets, or, in more technical words, what kind of trade-off are we dealing with?
In the simple solution, before being able to answer a $k$-NN query all we need to do is to compute the feature vectors of each object in the database. On the other hand, when using the trick, before we are able to answer a query we not only need to compute the feature vectors, but also we need to build the Space Partitioning Tree. In the experiment we run, we also recorded the time it takes to be able to answer queries. The results are displayed in Figure 5 and show that the trick-based solutions scale much worse than the simple solution. This means that when using the trick we are trading off query response time with start-up time.
This trade-off must be taken carefully, and for big databases this can have very negative consequences. Consider an e-commerce website that goes down for whatever reason; imagine that this e-commerce uses $k$-NN to serve some recommendations, (a very important yet not critical part of the system). As soon as we fix the problem, we want the system to reboot as soon as possible, but if the booting process depends on the $k$-NN system we fall into the trap – users won’t be able to purchase anything until our Space Partitioning Tree is built. Not good.
This can be easily solved by breaking the dependence using parallel or asynchronous processes to boot different parts of the system, but the simple solution is clearly more robust in this instance, up to a point where we don’t even need to care. The $k$-NN trick forces us to consider this very carefully and act properly. For many applications, this isn’t a bad price to pay for the speed and scalability we get at query time.
9. Conclusion
In this post we described a trick to speed up the standard solution for the $k$-NN problem with cosine similarity. The mathematical rationale for the trick was presented, as well as experiments that prove its validity. We consider this as a good example of a scalability problem overcome by applying elementary maths. This is also a good example of Reductionism: The trick is a reduction from cosine similarity $k$-NN problem to a Euclidean distance $k$-NN problem which is a much more studied and solved problem. Maths and Reductionism are two concepts sitting at the core of applied Data Science at Booking.com, always at the service of the best travelling experience.
Fig 5: Ready-time comparison of the full scan solution and the $k$-NN trick for different database sizes n