How to normalize Lucene scores?_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-18 14:31 出处：网络

I need to normalize the Lucene scores between 0 and 1. For example, a random query returns the following scores...

I need to normalize the Lucene scores between 0 and 1.

For example, a random query returns the following scores...

8.864665
2.792687
2.792687
2.792687
2.792687
0.49009037
0.33730242 
0.33730242 
0.33730242 
0.33730242开发者_如何学C

What's the biggest score ? 10.0 ?

thanks

You can divide all scores with the maximum score to get scores between 0 and 1.

However, please note that the normalised scores should be used to compare the results of a single query only. It is not correct to compare the scores (normalised or not) of results from 2 different queries.

There is no good standard way to normalize scores with lucene. Read this: ScoresAsPercentages and this explanation

In your case the highest score is the score of the first result, if the results are sorted by score. But this score will be different for every other query.

See also how-do-i-normalise-a-solr-lucene-score

There is no maximum score in Solr, it depends on too many variables, so it can't be predicted.

But you can implement something called normalized score (Scores As Percentages) which is not recommended.

See related links for more details:

Is it possible to set a Solr Score threshold 'reasonably', independent of results returned? (i.e. Is Solr Scoring standardized in any way)

how do I normalise a solr/lucene score?

Remove results below a certain score threshold in Solr/Lucene?

A regular normalization will only help you to compare the scoring distribution among queries (and theirs retrieved lists). You cannot simply normalize the score to compare the performance between queries. Think of a query which all retrieved documents are highly relevant and received the same (high score), and on another query that the retrieved list comprise barley relevant document (again, with the same score) - now, no matter the per-query normalization you make - the normalized score will be the same.

You need to think on a cross-query factor that can bring all the scores to the same level.

For example - maybe computing similarity between the query and the whole index, and use that score somehow along with the document-score

If you want to compare two or more queries, i found an workaround. You can compare your highest scored document with your queryterm using the LevenstheinDistance or LuceneLevenstheinDistance(Damerau) class to get the distance between your queryterm and your result. The result is the similiarity between them. Do this for each query you want to compare against. Now you have a tool to compare your queries using the similiarity of your querytherm and your highest result. You can now choose the query with the highest score of similiarity and use this for next proper actions.

    //Damerau LevenstheinDistance
    LuceneLevenshteinDistance d = new LuceneLevenshteinDistance();

    similiarity = d.getDistance(queryterm, yourResult );

I applied a non-linearity function in order to compress every queries.