Trang thông tin tổng hợp
Trang thông tin tổng hợp
  • người nổi tiếng
  • chính tả
  • Giáo Dục
  • Meme
  • Avatar
  • Tranh Tô Màu
  • Ảnh Nail
người nổi tiếng chính tả Giáo Dục Meme Avatar Tranh Tô Màu Ảnh Nail
  1. Trang chủ
  2. Giáo Dục
Mục Lục

[Using LSH Algorithm for Document Search]

avatar
mana
03:10 10/10/2025

Mục Lục

In this article, I will propose and analyze an improved LSH algorithm for the text retrieval problem. First of all, I recommend studying several articles about LSH itself (this one, for example), because I will not delve into the mathematics of this algorithm.

If you are already familiar with the algorithm, here is a short reminder of its features:

  • The algorithm approximates the Jaccard metric, but does so faster with minimal loss of accuracy.
  • This is a powerful algorithm that can improve search performance. It can be used directly as a document search tool or as part of a recommendation system that can find relevant documents (due to the nature of the algorithm, relevance here coincides with syntactic similarity).
  • Articles about the algorithm use specific terms: - The shingle is a fragment of the original document. When we talk about textual similarity, a shingle in most cases is a classic character n- gram. But I have also used LSH with n-gram tokens for specific tasks. Shingling is then a tokenization process. - Min Hashing is a vectorization process that converts the original set of shingles into a signature. - LSH converts signatures into buckets by splitting them into bands and adding strings with the same bands together. This allows us to aggregate similar strings faster.

Before we get into LSH itself, I'd like to directly measure the key metrics to understand what we're getting into. I use the Google trends corpus because it contains short queries that are easy to work with. Below I will do some pre-processing.

I will solve the problem of finding the top 3 most similar queries. Important note - I will be running all processes in one thread. With the classic approach, I need to calculate the Jaccard metrics between the example and each query in the corpus. To speed up this process, I will save the corpus n-grams into a simple dictionary in advance. To measure the speed of the direct approach, I scan the entire corpus and calculate Jaccard metrics for each example. And to get the maximum Jaccard score we can achieve, I calculate the Jaccard score between the most similar pairs and average it.

The algorithm training time took approximately 2-3 seconds (Macbook Pro 2,6 GHz 6-Core Intel Core i7). Let's make sure we've counted the nearest pairs correctly.

Let's calculate the maximum average score we can get close to.

And let's calculate the time spent searching for the nearest strings in the entire corpus.

The direct approach has obvious problems: storing n-grams requires a lot of memory (storing one-hot encoded vectors is no longer efficient), it does not scale well, and it is a slow algorithm. Of course, we wouldn't do this in real life, but these metrics clearly show the starting position.

But how does classic LSH work? The first step is shingling. As I mentioned, this is the same process as converting tokens to n-grams. The next step is to get a one-hot vector of shingles. But before this, it is necessary to calculate all possible shingles of the corpus.

0 Thích
Chia sẻ
  • Chia sẻ Facebook
  • Chia sẻ Twitter
  • Chia sẻ Zalo
  • Chia sẻ Pinterest
In
  • Điều khoản sử dụng
  • Chính sách bảo mật
  • Cookies
  • RSS
  • Điều khoản sử dụng
  • Chính sách bảo mật
  • Cookies
  • RSS

Trang thông tin tổng hợp itt

Website itt là blog chia sẻ vui về đời sống ở nhiều chủ đề khác nhau giúp cho mọi người dễ dàng cập nhật kiến thức. Đặc biệt có tiêu điểm quan trọng cho các bạn trẻ hiện nay.

© 2025 - itt

Kết nối với itt

https://nghengu.vn/ lịch âm hôm nay Xem thời tiết tại thoitiet24.edu.vn vivu88 fo88
Trang thông tin tổng hợp
  • Trang chủ
  • người nổi tiếng
  • chính tả
  • Giáo Dục
  • Meme
  • Avatar
  • Tranh Tô Màu
  • Ảnh Nail
Đăng ký / Đăng nhập
Quên mật khẩu?
Chưa có tài khoản? Đăng ký