开发者

simhash like algorithm to compare two text documents

开发者 https://www.devze.com 2023-03-12 07:31 出处:网络
The problem is: I have a collection of text documents, i want to pick up the most similar one to the input one.

The problem is: I have a collection of text documents, i want to pick up the most similar one to the input one. The input text document could be exactly match or modified par开发者_如何转开发tly. The algorithm must be very fast.

Currently, I found simhash to take a fingerprint from collection documents. Is there any other algorithm to do the same thing?


LSH (Locality Sensitive Hashing) techniques are general indexing methods. They are very efficient at finding approximate nearest neighbors.

SimHash is one hashing algorithm for LSH. It uses cosine similarity over real-valued data.

MinHash is another hashing algorithm for LSH. It calculates resemblance similarity over binary vectors.

Mining of Massive Dataset, Chapter 3 by Anand Rajaraman and Jeff Ullman. is good introduction to the problem space and MinHash in particular.


have you tried LSH(locality sensitive Hashing) techniques

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号