开发者

Solr: combining EdgeNGramFilterFactory and NGramFilterFactory

开发者 https://www.devze.com 2023-03-31 20:44 出处:网络
I have a situation where I need to u开发者_Go百科se both EdgeNGramFilterFactory and NGramFilterFactory.

I have a situation where I need to u开发者_Go百科se both EdgeNGramFilterFactory and NGramFilterFactory.

I am using NGramFilterFactory to perform a "contains" style search with min number of characters as 2. I also want to search for the first letter, like a "startswith" with a front EdgeNGramFilterFactory.

I dont want to lower the NGramFilterFactory to min characters of 1 as I dont want to index all characters.

Some help would be greatly appreciated

Cheers


You don't necessarily have to do all this in the same field. I would create a different fields using different custom types for each treatment so that you can apply the logic separately.

In the following:

  • text contains the original tokens, minimally processed;
  • text_ngram uses the NGramFilter for your two-character-minimum tokens
  • text_first_letter uses EdgeNGram for your one-character initial-letter tokens

If you're processing all text fields in this way, then you might be able to get away with using a copyField to populate the fields. Otherwise, you can instruct your Solr client to send in the same field values for the three separate field types.

When searching, include all of them in your searches with the qf parameter.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
</fieldType>

<fieldType name="text_first_letter" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="1" side="front"/>
  </analyzer>
</fieldType>

Setting up field and dynamicField definitions are left up to you. Or let me know if you have more questions and I can edit with clarifications.


Start by applying the EdgeNgramFilter with min = 1 and max = 1000 (we want the entire original token to be included). Example:

hello => 'h', 'he', 'hel', 'hell', 'hello'

Secondly use the NGramFilter with min = 2. (I will use 2 as the max in the example for simplicity)

'h', 'he', 'hel', 'hell', 'hello' => 'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo'

Now you will have several identical tokens since you have applied the NGramFilter on all "partial" tokens from the EdgeNGramFilter but simply apply the RemoveDuplicatesTokensFilter to remove those.

'h', 'he', 'he', 'el', 'he', 'el', 'll', 'he', 'el', 'll', 'lo' => 'h', 'he', 'el', 'll', 'lo'

Now your field will support a single char "startsWith" query and a multiple chars "contains" query.

0

精彩评论

暂无评论...
验证码 换一张
取 消