How do I detect if there is already a similar document stored in Lucene index_问答_开发者

How do I detect if there is already a similar document stored in Lucene index

开发者 https://www.devze.com 2022-12-19 11:51 出处：网络

I need to exclude duplicates in my database. The problem is that duplicates are not considered exact match but rather similar documents. For this purpose I decided to use FuzzyQuery like follows:

I need to exclude duplicates in my database. The problem is that duplicates are not considered exact match but rather similar documents. For this purpose I decided to use FuzzyQuery like follows:

var fuzzyQuery = new global::Lucene.Net.Search.FuzzyQuery(
                     new Term("text", queryText),
                     0.8f,
                     0);
 hits = _searcher.Search(query);

The idea was to set the minimal similarity to 0.8 (that I think is high enough) so only similar documents will be found excluding those that are not sufficiently similar.

To test this code I decided to see if it finds already existing document. To the variable queryText was assigned a value that is stored in the index. The code from above found nothing, in other words it doesn't detect even exact match.

Index was build by this code:

 doc.Add(new global::Lucene.Net.Documents.Field(
            "t开发者_如何学Pythonext",
            text,
            global::Lucene.Net.Documents.Field.Store.YES,
            global::Lucene.Net.Documents.Field.Index.TOKENIZED,
            global::Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

I followed recomendations from bellow and the results are: TermQuery doesn't return any result. Query contructed with

 var _analyzer = new RussianAnalyzer();
 var parser = new global::Lucene.Net.QueryParsers
                .QueryParser("text", _analyzer);
 var query = parser.Parse(queryText);
 var _searcher = new IndexSearcher
       (Settings.General.Default.LuceneIndexDirectoryPath);
 var hits = _searcher.Search(query);

Returns several results with the maximum score the document that has exact match and other several documents that have similar content.

It might help to look inside the index - will clearly show what data you're querying against and how Lucene 'sees' you data. You can use Luke for this. It has some known compatibility issues with Lucent.NET but is much better than nothing anyway.

I second the recommendation for Luke. A few other things to try:

Try first an exact query, say a TermQuery for the term "text". If this doesn't work, no fuzzy query will.
Use Explain() to see how the scoring went (that is provided you get other hits).
Follow the suggestions from Debugging Relevance Issues in Search.

Try the MoreLikeThis class in Lucene...it has some great heuristics encoded that would help you identify "similar" documents.