The paper: http://www2007.org/papers/paper215.pdf
I am just wondering are there any implementations of chapter 3 of that paper. I mean querying among large datasets, NOT only the simhash 开发者_JAVA技巧(it's easy to find simhash implementations).
Thanks~
Here is one though I haven't tested it works. The good thing its opensource.
This is a problem in Data mining
and similarity search
. There are numerous articles describing how that can be done, and scaling up to massive amounts of data.
I have an implementation (github : mksteve, clustering, with some comments about it in my blog) of wikipedia : Metric tree . This requires that the measures you are making meet the triangle inequality (wikipedia : Metric space. That is that the metric distance from item A to item C is less than or equal to the distance A to B + the distance B to C.
Given that inequality, it is possible to trim the search space, so only sub-trees which may overlap with your target area, are searched. Without that feature being true (metric-space).
Possibly the number of bits of difference in the simhash would be a metric space.
The general usage of these datasets, is alluded to in the document when it mentions mapReduce, which is generally run on a hadoop cluster
. The processing nodes are each given a sub-set of the data, and find a set of target matches from their local datasets. These are then combined to give a fully ordered list of similar items.
There are some papers (unsure of references) which allude to using m-trees in a cluster, where different parts of the search space are given to different clusters, but I am not sure whether the hadoop infra-structure would support using such a high level abstraction.
精彩评论