开发者

Exact phrase with special characters in Lucene.net

开发者 https://www.devze.com 2023-03-27 05:50 出处:网络
i\'ve a problem doing a full text search in lucene.net where the search result contains special lucene characters.

i've a problem doing a full text search in lucene.net where the search result contains special lucene characters.

I've a field named "content" in my Lucene documents. This field is created as followed and contains the content of the indexed documents:

document.Add(new Field("content", fulltext, Field.Store.YES, Field.Index.ANALYZED));

For creating the index i'm using the Standardanalyzer.

For querying the index i'm using the following code:

var queryParser = new QueryParser(Luc开发者_运维问答ene.Net.Util.Version.LUCENE_29, "content", analayzer);
queryParser.SetAllowLeadingWildcard(true);
queryParser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
Query fullTextQuery = queryParser.Parse(queryString);

The query is then added to a BooleanQuery which is used to get the results from a IndexSearcher. I think the rest of the code is not that important, because the code works like it should for 99% of the queries. I'm also using the StandardAnalyzer for querying the index.

Now here is the problem. Sometimes the "content" field of a document contains text that is separated using "-"

some text some text selector-lever some text some text

Now when i'm doing a full text search (exact phrase) using "selector lever". The query looks like this:

content:"selector lever"

The problem here is that also the document containing the above text is found, although it shouldn't be found because the 2 words are separated using the "-" and not blank.

I think it has something to do with the analyzer and the fact that "-" is a special character in lucene.

Maybe someone can help me solving this problem.

thanks in advance Martin


You are right in thinking that the problem is the analyzer that you are using at index time.

From the Lucene javadocs:

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

Therefore, in your case you would need to index your documents with a more strict Analyzer like the WhitespaceAnalyzer which only splits on whitespace.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号