In this article, I will propose and analyze an improved LSH algorithm for the text retrieval problem. First of all, I recommend studying several articles about LSH itself (this one, for example), because I will not delve into the mathematics of this algorithm.
If you are already familiar with the algorithm, here is a short reminder of its features:
- The algorithm approximates the Jaccard metric, but does so faster with minimal loss of accuracy.
- This is a powerful algorithm that can improve search performance. It can be used directly as a document search tool or as part of a recommendation system that can find relevant documents (due to the nature of the algorithm, relevance here coincides with syntactic similarity).
- Articles about the algorithm use specific terms: - The shingle is a fragment of the original document. When we talk about textual similarity, a shingle in most cases is a classic character n- gram. But I have also used LSH with n-gram tokens for specific tasks. Shingling is then a tokenization process. - Min Hashing is a vectorization process that converts the original set of shingles into a signature. - LSH converts signatures into buckets by splitting them into bands and adding strings with the same bands together. This allows us to aggregate similar strings faster.
Before we get into LSH itself, I'd like to directly measure the key metrics to understand what we're getting into. I use the Google trends corpus because it contains short queries that are easy to work with. Below I will do some pre-processing.
I will solve the problem of finding the top 3 most similar queries. Important note - I will be running all processes in one thread. With the classic approach, I need to calculate the Jaccard metrics between the example and each query in the corpus. To speed up this process, I will save the corpus n-grams into a simple dictionary in advance. To measure the speed of the direct approach, I scan the entire corpus and calculate Jaccard metrics for each example. And to get the maximum Jaccard score we can achieve, I calculate the Jaccard score between the most similar pairs and average it.
The algorithm training time took approximately 2-3 seconds (Macbook Pro 2,6 GHz 6-Core Intel Core i7). Let's make sure we've counted the nearest pairs correctly.
Let's calculate the maximum average score we can get close to.
And let's calculate the time spent searching for the nearest strings in the entire corpus.
The direct approach has obvious problems: storing n-grams requires a lot of memory (storing one-hot encoded vectors is no longer efficient), it does not scale well, and it is a slow algorithm. Of course, we wouldn't do this in real life, but these metrics clearly show the starting position.
But how does classic LSH work? The first step is shingling. As I mentioned, this is the same process as converting tokens to n-grams. The next step is to get a one-hot vector of shingles. But before this, it is necessary to calculate all possible shingles of the corpus.