开发者

Tokenizing hashtags in Lucene.Net

开发者 https://www.devze.com 2023-02-25 06:51 出处:网络
I am u开发者_如何学运维sing Lucene.Net (version 2.9). I would like to preserve tweet post \'@name\' or \'#Note\'.

I am u开发者_如何学运维sing Lucene.Net (version 2.9). I would like to preserve tweet post '@name' or '#Note'.

Using the Lucene AnalyzerViewer tool (http://www.codeproject.com/KB/cs/lucene_analysis.aspx?msg=3326095#xx3326095xx) to review tokens produced by different analyzer.

For example, tokens produced below from this text: "#Note: Excercise, to live longer."

  • Whitespace Analyzer: [#Note:] [Excercise,] [to] [live] [longer.]
  • Standard Analyzer: [note] [excercise] [live] [longer]
  • Simple Analyzer: [note] [excercise] [to] [live] [longer]

'Whitespace Analyzer' preserve the hash tags. I created a custom analyzer, which uses WhitespaceTokenizer and lower case.

Custom Analyzer code...

public class CustomAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
    {
        TokenStream result = new Lucene.Net.Analysis.WhitespaceTokenizer(reader);

        // Makes sure everything is lower case
        result = new LowerCaseFilter(result);

        //Return the built token stream.)
        return result;
    }
}

However, the custom analyzer leaves punctuations. Tokens produced by the custom analyzer: [#note:] [excercise,] [to] [live] [longer.]

Any suggestions to use a filter where '#', '@' tags preserve and punctuations removed?

Thanks in advance.


In the java version of lucene there is a PatternAnalyzer, that lets you specify a pattern that will be used to split the tokens.

Documentation: http://lucene.apache.org/java/2_9_4/api/contrib-memory/org/apache/lucene/index/memory/PatternAnalyzer.html

You could watch out for a .net version of this analyzer or port it your own.

0

精彩评论

暂无评论...
验证码 换一张
取 消