I need to calculate the similarity of a query and document in Lucene using Jaccard similarity over n-grams. As Jaccard similarity is is a very common measure in IR, 开发者_开发问答I expected to find a Lucene implementation for it, but I couldn't.
Is anyone aware of such an implementation?
The only implementation I'm aware of that can be easily integrated with Lucene is the one from LingPipe (please note that it's free only for non-commercial/research usage). Here is a blog post showing how to use it in LingPipe. A detailed explanation on how to connect both libraries is available on LingPipe website and in this book.
I haven't evaluated however, if it wouldn't be easier (also from license point of view) to integrate some other implementation on your own -- it's just a solution that worked for me.
Try this library http://sourceforge.net/projects/simmetrics/ you find much more similarity functions. But I will recommend you to use SoftTFIDF from http://secondstring.sourceforge.net/, it has the best precision/recall according "A Comparison of String Distance Metrics for Name-Matching Tasks". William W. Cohen and others.
精彩评论