开发者

Lucene 3.1 payload

开发者 https://www.devze.com 2023-03-18 21:34 出处:网络
I\'m trying to figure out the way that payloads work in Lucene and I can\'t seem to grasp it. My situation is as follows:

I'm trying to figure out the way that payloads work in Lucene and I can't seem to grasp it. My situation is as follows:

I need to index a document that has a single content field and attach to each token from the text within that field a payload (some 10 bytes). The analyzer I need to use is a basic whitespace analyzer.

From the various articles I've been reading on the internet, the way to do work with payloads would be to create my own Analyzer and attach the payload during the tokenizing step. I've come up with the following code for my new custom analyzer:

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream tokenStream = new WhitespaceTokenizer(Version.LUCENE_31,
            reader);

    OffsetAttribute offsetAttribute = tokenStream
            .getAttribute(OffsetAttribute.class);
    CharTermAttribute termAttribute = tokenStream
            .getAttribute(CharTermAttribute.class);
    if (!tokenStream.hasAttribute(PayloadAttribute.class)) {
        tokenStream.addAttribute(PayloadAttribute.class);
    }
    PayloadAttribute payloadAttribute = tokenStream
            .getAttribute(PayloadAttribute.class);

    try {
        while (tokenStream.incrementToken()) {
            int startOffset = offsetAttribute.startOffset();
            int endOffset = offsetAttribute.endOffset();

            String token;

            try{
                token = (termAttribute.subSequence(startOffset, endOffset)).toString();
            }
            catch(IndexOutOfBoundsException ex){
                token = new String(termAttribute.buffer());
            }

            byte[] payloadBytes = payloadGenerator.generatePayload(token,
                    frequencyClassDigest);
            payloadAttribute.setPayload(new Payload(payloadBytes));
        }
        tokenStream.reset();

        return tokenStream;
    } catch (IOException e) {
        e.printStackTrace();
        return null;
    }
}

The problems that I a开发者_StackOverflowm having are the following:

  1. I can't correctly read the individual tokens. I'm not sure that by using the CharTermAttribute is the correct way to do it, but I know that it just doesn't work. I need to get to the individual token in order to calculate the payload correctly, but somehow the WithespaceTokenizer returns the individual words glued together (3 words at a time).
  2. I don't know if using the PayloadAttribute is the correct way to attach a payload to a token. Maybe you know of another way

Where can I find some good tutorial on how to actually use Payloads in Lucene? I've tried searching the web and the only good article I was able to find was this: Lucene Payload tutorial however it doesn't exactly suit my needs.

Thank you

I can't seem to find a good tutorial


You could encapsulate your payload generation logic inside a filter that would generate a payload for each token that comes through the filter. I've modeled this off Lucene's DelimitedPayloadTokenFilter.

public final class PayloadGeneratorFilter extends TokenFilter {
  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
  private final PayloadAttribute payAtt = addAttribute(PayloadAttribute.class);
  private final PayloadGenerator payloadGenerator;
  private final FrequencyClassDigest frequencyClassDigest;


  public PayloadGeneratorFilter(TokenStream input, PayloadGenerator payloadGenerator,
                                FrequencyClassDigest frequencyClassDigest) {
    super(input);
    this.payloadGenerator = payloadGenerator;
    this.frequencyClassDigest = frequencyClassDigest;
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      final char[] buffer = termAtt.buffer();
      final int length = termAtt.length();
      String token = buffer.toString();
      byte[] payloadBytes = payloadGenerator.generatePayload(token, frequencyClassDigest);
      payAtt.setPayload(new Payload(payloadBytes));
      return true;
    }

    return false;
  }
}

This would make your analyzer code very simple:

public class NLPPayloadAnalyzer extends Analyzer {
  private PayloadGenerator payloadGenerator;
  private FrequencyClassDigest frequencyClassDigest;

  public NLPPayloadAnalyzer(PayloadGenerator payloadGenerator,
                            FrequencyClassDigest frequencyClassDigest) {
    this.payloadGenerator = payloadGenerator;
    this.frequencyClassDigest = frequencyClassDigest;
  }

  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream tokenStream = new WhitespaceTokenizer(Version.LUCENE_31, reader);
    tokenStream = new PayloadGeneratorFilter(tokenStream, payloadGenerator, frequencyClassDigest);
    return tokenStream;
  }
}

The alternative would be to pre-process your payloads and append them in the text that you send to Lucene and then use the DelimitedPayloadTokenFilter. text text text text would become text|1.0 text|2.2 text|0.5 text|10.5

http://sujitpal.blogspot.com/2010/10/denormalizing-maps-with-lucene-payloads.html is also a good resource.

0

精彩评论

暂无评论...
验证码 换一张
取 消