Maths! Approximating the mean, without storing the whole data set_问答_开发者

Obvious (but expensive) solution:

I would like to store rating of a track (1-10) in a table like this:

TrackID
Vote

And then a simple

开发者_StackOverflow中文版SELECT AVERAGE(Vote) FROM `table` where `TrackID` = some_val

to calculate the average.

However, I am worried about scalability on this, especially as it needs to be recalculated each time.

Proposed, but possibly stupid, solution:

TrackID
Rating
NumberOfVotes

Every time someone votes, the Rating is updated with

new_rating = ((old_rating * NumberOfVotes) + vote) / (NumberOfVotes + 1)

and stored as the TrackID's new Rating value. Now whenever the Rating is wanted, it's a simple lookup, not a calculation.

Clearly, this does not calculate the mean. I've tried a few small data sets, and it approximates the mean. I believe it might converge as the data set increases? But I'm worried that it might diverge!

What do you guys think? Thanks!

Assuming you had infinite numeric precision, that calculation does update the mean correctly. In practice, you're probably using integer types, so it won't be exact.

How about storing the cumulative vote count, and the number of votes? (i.e. total=total+vote, numVotes=numVotes+1). That way, you can get the exact mean by dividing one by the other.

This approach will only break if you get so many votes that you overflow the range of the data type you're using. So use a big data type (32-bit ought to be enough, unless you're expecting ~4 billion votes)!

Store TrackId, RatingSum, NumberOfVotes in your table.

Every time someone votes,

NumberOfVotes = NumberOfVotes + 1
RatingsSum = RatingsSum + [rating supplied by user]

Then when selecting

SELECT TrackId, RatingsSum / NumberOfVotes FROM ...

Your solution is completely legit. and differes only by roughly a few times the floating point precision from a value calculated from the full source set.

You can certainly calculate a running mean and standard deviation without having all the points in hand. You merely need to accumulate the sum, sum of squares, and number of points.

It's not an approximation; the mean and standard deviation are exact.

Here's a Java class that demonstrates. You can adapt to your SQL solution as needed:

package statistics;

public class StatsUtils
{
    private double sum;
    private double sumOfSquares;
    private long numPoints;

    public StatsUtils()
    {
        this.init();
    }

    private void init()
    {
        this.sum = 0.0;
        this.sumOfSquares = 0.0;
        this.numPoints = 0L;
    }

    public void addValue(double value)
    {
        // Check for overflow in either number of points or sum of squares; reset if overflow is detected
        if ((this.numPoints == Long.MAX_VALUE) || (this.sumOfSquares > (Double.MAX_VALUE-value*value)))
        {
            this.init();
        }

        this.sum += value;
        this.sumOfSquares += value*value;
        ++this.numPoints;
    }

    public double getMean()
    {
        double mean = 0.0;

        if (this.numPoints > 0)
        {
            mean = this.sum/this.numPoints;
        }

        return mean;
    }

    public double getStandardDeviation()
    {
        double standardDeviation = 0.0;

        if (this.numPoints > 1)
        {
            standardDeviation = Math.sqrt((this.sumOfSquares - this.sum*this.sum/this.numPoints)/(this.numPoints-1L));
        }

        return standardDeviation;
    }

    public long getNumPoints() { return this.numPoints; }
}

Small improvement on your solution. You have the table:

TrackID
SumOfVotes
NumberOfVotes

When someone votes,

NumberOfVotes = NumberOfVotes + 1
SumOfVotes = SumOfVotes + ThisVote

and to see the average you only then do a division:

SELECT TrackID, (SumOfVotes/NumberOfVotes) AS Rating FROM `table`

I would add that the original (obvious and expensive) solution is only expensive compared to the provied solution when calculating the average. It is cheaper when a vote is added, deleted or changed. I guess that the original table

TrackID
Vote
VoterID

would still need to be used in the provided solution to keep track of the vote (rating) of every voter. SO, two tables have to be updated for every change in this table (insert, delete or Vote update).

In other words, the original solution may be the best way to go.