开发者

How to sort solr without stopwords

开发者 https://www.devze.com 2023-01-10 11:22 出处:网络
I\'m trying to sort a solr query by a field ignoring stopwords, but can\'t seem to find a way to do that.For example, I want the results to be sorted like:

I'm trying to sort a solr query by a field ignoring stopwords, but can't seem to find a way to do that. For example, I want the results to be sorted like:

  • Charlie
  • A Fox
  • Helicopter

Is this possible? Right now the field type is defined like:

    <fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
        <charFilter class="solr.MappingCharFilterFactor开发者_JAVA百科y" mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
      </analyzer>
    </fieldType>

And the field is added like:

    <field name="title" type="alphaOnlySort" indexed="true" stored="false"/>

It seems like someone else would've had to do this too? Or is sorting without stopwords a no-no?


KeywordTokenizerFactory does not break the content into individual pieces so StopFilterFactory is trying to match the token (the entire content) to the stop word list and finding no matches. To get the stop words out of the index you need to use a tokeniser like WhitespaceTokenizerFactory BUT you cannot sort on a tokenised field. So the only way I can think to do this is to:

  1. still use KeywordTokenizerFactory,
  2. get rid of StopFilterFactory
  3. and to remove the stop words from the content using a regular expression using PatternReplaceFilterFactory (which is currently being used to strip numbers).

Generally the only stop words you want for sorting (not searching) are "A", "AN", "THE". I'm not very good at reg expressions but I'm sure this is trivial for many.


You need to actually add the Stopwords Filter to the chain of parsers. Paste your text to be indexed into the field analyser in Solr Admin and you'll see that the A in A Fox is not being dropped!


Using the analyser mentioned by Eric, I've determined that the stop word filter only grabs exact words that matched, not pieces of a sentence. So, if there's a token of "THE" it will remove it. But, if there's a token of "THE FISH", it won't touch it.

So, is there a way to make this work? I just want to sort on a field, ignoring any stopwords. But the result is a bunch of sentences (or book names).

0

精彩评论

暂无评论...
验证码 换一张
取 消