开发者

Lucene.NET: Camel case tokenizer?

开发者 https://www.devze.com 2023-01-15 10:22 出处:网络
I\'ve started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyze开发者_如何学Gors/tokenizers tre

I've started playing with Lucene.NET today and I wrote a simple test method to do indexing and searching on source code files. The problem is that the standard analyze开发者_如何学Gors/tokenizers treat the whole camel case source code identifier name as a single token.

I'm looking for a way to treat camel case identifiers like MaxWidth into three tokens: maxwidth, max and width. I've looked for such a tokenizer, but I couldn't find it. Before writing my own: is there something in this direction? Or is there a better approach than writing a tokenizer from scratch?

UPDATE: in the end I decided to get my hands dirty and I wrote a CamelCaseTokenFilter myself. I'll write a post about it on my blog and I'll update the question.


Solr has a WordDelimiterFactory which generates a tokenizer similar to what you need. Maybe you can translate the source code into C#.


Below link might be helpful to write custom tokenizer...

http://karticles.com/NoSql/lucene_custom_tokenizer.html


Here is my implementation :

package corp.sap.research.indexing;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class CamelCaseFilter extends TokenFilter {

    private final CharTermAttribute _termAtt;

    protected CamelCaseScoreFilter(TokenStream input) {
        super(input);
        this._termAtt = addAttribute(CharTermAttribute.class);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        CharTermAttribute a = this.getAttribute(CharTermAttribute.class);
        String spliettedString = splitCamelCase(a.toString());
        _termAtt.setEmpty();
        _termAtt.append(spliettedString);
        return true;

    }


    static String splitCamelCase(String s) {
           return s.replaceAll(
              String.format("%s|%s|%s",
                 "(?<=[A-Z])(?=[A-Z][a-z])",
                 "(?<=[^A-Z])(?=[A-Z])",
                 "(?<=[A-Za-z])(?=[^A-Za-z])"
              ),
              " "
           );
        }
}
0

精彩评论

暂无评论...
验证码 换一张
取 消