开发者

Lucene bigrams tokenizer to include punctuation signs

开发者 https://www.devze.com 2023-03-11 23:33 出处:网络
Is there any chance that I could use Lucene\'s ShingleAnalyzerWrapper to generate bigrams taking into account punctuation signs (i.e:.\\,\\;)开发者_如何学编程? Quick example: given the field \"one two

Is there any chance that I could use Lucene's ShingleAnalyzerWrapper to generate bigrams taking into account punctuation signs (i.e:.\,\;)开发者_如何学编程? Quick example: given the field "one two; three four" would provide 2 bigrams only: (one two) and (three four)?


You could create a ShingleAnalyzerWrapper that uses an analyzer based on LetterTokenizer. LetterTokenizer breaks the input text at non letters. Something like:

public class MyCharAnalyzer extends Analyzer { 

  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new LetterTokenizer(reader);    
    return result;
  }
}

ShingleAnalyzerWrapper myBigramWrapper = new ShingleAnalyzerWrapper(new MyCharAnalyzer());

If you wanted better control over what you consider punctuation, you could subclass CharTokenizer and override the isTokenChar() method.

0

精彩评论

暂无评论...
验证码 换一张
取 消