For a school project, we'll have to implement a ranking system. However, we figured that a dumb rank average would suck: something that one user ranked 5 stars would have a better average that something 188 users rank开发者_开发问答ed 4 stars, and that's just stupid.
So I'm wondering if any of you have an example algorithm of "smart" ranking. It only needs to take in account the rankings given and the number of rankings.
Thanks!
You can use a method inspired by Bayesian probability. The gist of the approach is to have an initial belief about the true rating of an item, and use users' ratings to update your belief.
This approach requires two parameters:
- What do you think is the true "default" rating of an item, if you have no ratings at all for the item? Call this number
R
, the "initial belief". - How much weight do you give to the initial belief, compared to the user ratings? Call this
W
, where the initial belief is "worth"W
user ratings of that value.
With the parameters R
and W
, computing the new rating is simple: assume you have W
ratings of value R
along with any user ratings, and compute the average. For example, if R = 2
and W = 3
, we compute the final score for various scenarios below:
- 100 (user) ratings of 4:
(3*2 + 100*4) / (3 + 100) = 3.94
- 3 ratings of 5 and 1 rating of 4:
(3*2 + 3*5 + 1*4) / (3 + 3 + 1) = 3.57
- 10 ratings of 4:
(3*2 + 10*4) / (3 + 10) = 3.54
- 1 rating of 5:
(3*2 + 1*5) / (3 + 1) = 2.75
- No user ratings:
(3*2 + 0) / (3 + 0) = 2
- 1 rating of 1:
(3*2 + 1*1) / (3 + 1) = 1.75
This computation takes into consideration the number of user ratings, and the values of those ratings. As a result, the final score roughly corresponds to how happy one can expect to be about a particular item, given the data.
Choosing R
When you choose R
, think about what value you would be comfortable assuming for an item with no ratings. Is the typical no-rating item actually 2.4 out of 5, if you were to instantly have everyone rate it? If so, R = 2.4
would be a reasonable choice.
You should not use the minimum value on the rating scale for this parameter, since an item rated extremely poorly by users should end up "worse" than a default item with no ratings.
If you want to pick R
using data rather than just intuition, you can use the following method:
- Consider all items with at least some threshold of user ratings (so you can be confident that the average user rating is reasonably accurate).
- For each item, assume its "true score" is the average user rating.
- Choose
R
to be the median of those scores.
If you want to be slightly more optimistic or pessimistic about a no-rating item, you can choose R
to be a different percentile of the scores, for instance the 60th percentile (optimistic) or 40th percentile (pessimistic).
Choosing W
The choice of W
should depend on how many ratings a typical item has, and how consistent ratings are. W
can be higher if items naturally obtain many ratings, and W
should be higher if you have less confidence in user ratings (e.g., if you have high spammer activity). Note that W
does not have to be an integer, and can be less than 1.
Choosing W
is a more subjective matter than choosing R
. However, here are some guidelines:
- If a typical item obtains
C
ratings, thenW
should not exceedC
, or else the final score will be more dependent onR
than on the actual user ratings. Instead,W
should be close to a fraction ofC
, perhaps betweenC/20
andC/5
(depending on how noisy or "spammy" ratings are). - If historical ratings are usually consistent (for an individual item), then
W
should be relatively small. On the other hand, if ratings for an item vary wildly, thenW
should be relatively large. You can think of this algorithm as "absorbing"W
ratings that are abnormally high or low, turning those ratings into more moderate ones. - In the extreme, setting
W = 0
is equivalent to using only the average of user ratings. SettingW = infinity
is equivalent to proclaiming that every item has a true rating ofR
, regardless of the user ratings. Clearly, neither of these extremes are appropriate. - Setting
W
too large can have the effect of favoring an item with many moderately-high ratings over an item with slightly fewer exceptionally-high ratings.
I appreciated the top answer at the time of posting, so here it is codified as JavaScript:
const defaultR = 2;
const defaultW = 3; // should not exceed typicalNumberOfRatingsPerAnswers 0 is equivalent to using only average of ratings
function getSortAlgoValue(ratings) {
const allRatings = ratings.reduce((sum, r) => sum + r, 0);
return (defaultR * defaultW + allRatings) / (defaultW + ratings.length);
}
Only listed as a separate answer because the formatting of the code block as a reply wasn't very
Since you've stated that the machine would only be given the rankings and the number of rankings, I would argue that it may be negligent to attempt a calculated weighting method.
First, there are two many unknowns to confirm the proposition that in enough circumstances a larger quantity of ratings are a better indication of quality than a smaller number of ratings. One example is how long have rankings been given? Has there been equal collection duration (equal attention) given to different items ranked with this same method? Others are, which markets have had access to this item and, of course, who specifically ranked it?
Secondly, you've stated in a comment below the question that this is not for front-end use but rather "the ratings are generated by machines, for machines," as a response to my comment that "it's not necessarily only statistical. One person might consider 50 ratings enough, where that might not be enough for another. And some raters' profiles might look more reliable to one person than to another. When that's transparent, it lets the user make a more informed assessment."
Why would that be any different for machines? :)
In any case, if this is about machine-to-machine rankings, the question needs greater detail in order for us to understand how different machines might generate and use the rankings.
Can a ranking generated by a machine be flawed (so as to suggest that more rankings may somehow compensate for those "flawed" rankings? What does that even mean - is it a machine error? Or is it because the item has no use to this particular machine, for example? There are many issues here we might first want to unpack, including if we have access to how the machines are generating the ranking, on some level we may already know the meaning this item may have for this machine, making the aggregated ranking superfluous.
What you can find on different plattforms is the blanking of ratings without enough votings: "This item does not have enough votings"
The problem is you can't do it in an easy formula to calculate a ranking.
I would suggest a hiding of ranking with less than minimum votings but caclulate intern a moving average. I always prefer moving average against total average as it prefers votings from the last time against very old votings which might be given for totaly different circumstances.
Additionally you do not need to have too add a list of all votings. you just have the calculated average and the next voting just changes this value.
newAverage = weight * newVoting + (1-weight) * oldAverage
with a weight about 0.05 for a preference of the last 20 values. (just experiment with this weight)
Additionally I would start with these conditions:
no votings = medium range value (1-5 stars => start with 3 stars)
the average will not be shown if less than 10 votings were given.
A simple solution might be a weighted average:
sum(votes) / number_of_votes
That way, 3 people voting 1 star, and one person voting 5 would give a weighted average of (1+1+1+5)/4 = 2 stars.
Simple, effective, and probably sufficient for your purposes.
精彩评论