开发者

solr filter or tokenizer to make combinations of words

开发者 https://www.devze.com 2023-04-08 01:02 出处:网络
I\'m trying to implement a reasonable name suggest feature using a series of filters. At the moment I have

I'm trying to implement a reasonable name suggest feature using a series of filters. At the moment I have

        <fieldType name="suggester" class="solr.TextField" positionIncrementGap="1" autoGeneratePhraseQueries="true">
        <analyzer type="index">
            <tokenizer class="s开发者_高级运维olr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ShingleFilterFactory" outputUnigramsIfNoShingles="true" maxShingleSize="2"
                    outputUnigrams="true"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
            <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
                    catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ShingleFilterFactory" outputUnigramsIfNoShingles="true" maxShingleSize="2"
                    outputUnigrams="true"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15"/>
        </analyzer>
    </fieldType>

Which certainly needs more tuning, but I'm after one particular aspect for this question. For an input string mark daniel sievers the above will match on a query onmark and sievers but what I really want is to reduce the verbosity of the EdgeNGramFilter because it causes overmatching and use a filter/tokenizer that can combine words in some configurable manner, eg for input mark daniel rex sievers create tokens mark sievers, mark daniel sievers, mark rex sievers and so on. I didnt apply any paricular algorithm to that, but I'm wondering if such a beast exists (almost certainly does) or its best to write my own as a filter plugin?

Solr 3.3.0


I'd use a ShingleFilter : http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory

For example :

<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>

Input : mark daniel sievers.

Tokens produced : mark, mark daniel, mark daniel sievers, daniel, daniel sievers, sievers.

0

精彩评论

暂无评论...
验证码 换一张
取 消