Is there an implementation of the idea described in "Detecting NearDuplicates for Web Crawling"_问答_开发者

Is there an implementation of the idea described in "Detecting NearDuplicates for Web Crawling"

开发者 https://www.devze.com 2023-01-22 17:37 出处：网络

The paper: http://www2007.org/papers/paper215.pdf I am just wondering are there any implementations of chapter 3 of that paper. I mean querying among large datasets, NOT only the simhash 开发者_JAVA

The paper: http://www2007.org/papers/paper215.pdf

I am just wondering are there any implementations of chapter 3 of that paper. I mean querying among large datasets, NOT only the simhash 开发者_JAVA技巧(it's easy to find simhash implementations).

Thanks~

Here is one though I haven't tested it works. The good thing its opensource.

This is a problem in Data mining and similarity search. There are numerous articles describing how that can be done, and scaling up to massive amounts of data.

I have an implementation (github : mksteve, clustering, with some comments about it in my blog) of wikipedia : Metric tree . This requires that the measures you are making meet the triangle inequality (wikipedia : Metric space. That is that the metric distance from item A to item C is less than or equal to the distance A to B + the distance B to C.

Given that inequality, it is possible to trim the search space, so only sub-trees which may overlap with your target area, are searched. Without that feature being true (metric-space).

Possibly the number of bits of difference in the simhash would be a metric space.

The general usage of these datasets, is alluded to in the document when it mentions mapReduce, which is generally run on a hadoop cluster. The processing nodes are each given a sub-set of the data, and find a set of target matches from their local datasets. These are then combined to give a fully ordered list of similar items.

There are some papers (unsure of references) which allude to using m-trees in a cluster, where different parts of the search space are given to different clusters, but I am not sure whether the hadoop infra-structure would support using such a high level abstraction.