开发者

how to configure solr / lucene to perform levenshtein edit distance searching?

开发者 https://www.devze.com 2023-01-13 20:39 出处:网络
i have a long list of words that i put into a very simple SOLR / Lucene database. my goal is to find \'similar\' words from the list for single-term queries, where \'similarity\' is specifically under

i have a long list of words that i put into a very simple SOLR / Lucene database. my goal is to find 'similar' words from the list for single-term queries, where 'similarity' is specifically understood as (damerau) levensthein edit distance. i understand SOLR provides such a distance for spelling s开发者_高级运维uggestions.

in my SOLR schema.xml, i have configured a field type string:

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

which i use to define a field

<field name='term' type='string' indexed='true' stored='true' required='true'/>

i want to search this field and have results returned according to their levenshtein edit distance. however, when i run a query like webspace~0.1 against SOLR with debugging and explanations on, the report shows that a whole bunch of considerations went into calculating the scores, e.g.:

"1582":"
1.1353534 = (MATCH) sum of:
  1.1353534 = (MATCH) weight(term:webpage^0.8148148 in 1581), product of:
    0.08618848 = queryWeight(term:webpage^0.8148148), product of:
      0.8148148 = boost
      13.172914 = idf(docFreq=1, maxDocs=386954)
      0.008029869 = queryNorm
    13.172914 = (MATCH) fieldWeight(term:webpage in 1581), product of:
      1.0 = tf(termFreq(term:webpage)=1)
      13.172914 = idf(docFreq=1, maxDocs=386954)
      1.0 = fieldNorm(field=term, doc=1581)

clearly, for my application, term frequencies, idfs and so on are meaningless, as each document only contains a single term. i tried to use the spelling suggestions component, but didn't manage to make it return the actual similarity scores.

can anybody provide hints how to configure SOLR to perform levensthein / jaro-winkler / n-gram searches with scores returned and without doing additional stuff like tf, idf, boost and so included? is there a bare-bones configuration sample for SOLR somewhere? i find the number of options truly daunting.


If you're using a nightly build, then you can sort results based on levenshtein distance using the strdist function:

q=term:webspace~0.1&sort=strdist("webspace", term, edit) desc

More details here and here


Solr/Lucene doesn't appear to be a good fit for this application. You are likely better off. with SimMetrics library . It offers a comprehensive set of string-distance calculators incl. Jaro-Winkler, Levenstein etc.


how to configure SOLR to perform levensthein / jaro-winkler / n-gram searches with scores returned and without doing additional stuff like tf, idf, boost and so included?

You've got some solutions of how to obtain the desired results but none actually answeres your question.

q={!func}strdist("webspace",term,edit) will overwrite the default document scoring with the Levenstein distance and q={!func}strdist("webspace",term,jw) does the same for Jaro-Winkler.

The sorting suggested above will work fine in most cases but it doesn't change the scoring function, it just sorts the results obtained with the scoring method you want to avoid. This might lead to different results and the order of the groups might not be the same.

To see which ones would fit best a &debugQuery=true might be enough.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号