[Using LSH Algorithm for Document Search]

In this article, I will propose and analyze an improved LSH algorithm for the text retrieval problem. First of all, I recommend studying several articles about LSH itself (this one, for example), because I will not delve into the mathematics of this algorithm.

If you are already familiar with the algorithm, here is a short reminder of its features:

Before we get into LSH itself, I'd like to directly measure the key metrics to understand what we're getting into. I use the Google trends corpus because it contains short queries that are easy to work with. Below I will do some pre-processing.

I will solve the problem of finding the top 3 most similar queries. Important note - I will be running all processes in one thread. With the classic approach, I need to calculate the Jaccard metrics between the example and each query in the corpus. To speed up this process, I will save the corpus n-grams into a simple dictionary in advance. To measure the speed of the direct approach, I scan the entire corpus and calculate Jaccard metrics for each example. And to get the maximum Jaccard score we can achieve, I calculate the Jaccard score between the most similar pairs and average it.

The algorithm training time took approximately 2-3 seconds (Macbook Pro 2,6 GHz 6-Core Intel Core i7). Let's make sure we've counted the nearest pairs correctly.

Let's calculate the maximum average score we can get close to.

And let's calculate the time spent searching for the nearest strings in the entire corpus.

The direct approach has obvious problems: storing n-grams requires a lot of memory (storing one-hot encoded vectors is no longer efficient), it does not scale well, and it is a slow algorithm. Of course, we wouldn't do this in real life, but these metrics clearly show the starting position.

But how does classic LSH work? The first step is shingling. As I mentioned, this is the same process as converting tokens to n-grams. The next step is to get a one-hot vector of shingles. But before this, it is necessary to calculate all possible shingles of the corpus.

Link nội dung: https://itt.edu.vn/delve-into-la-gi-a11396.html