Similarity Between Users Based On Votes_问答_开发者

lets say i have a set of users, a set of songs, and a set of votes on each song:

=========== =========== =======
User        Song        Vote
=========== =========== =======
user1       song1       [score]
user1       song2       [score]
user1       song3       [score]
user2       song1       [score]
user2       song2       [score]
user2       song3       [score]
user3       song1       [score]
user3       song2       [score]
user3       song3     开发者_Python百科  [score]
user-n      song-n      [score]
=========== =========== =======

whats the most efficient way to calculate user similarity based on song-votes? is there a better way than iterating over every user and every vote for every song?

There are two common metrics that can be used to find similarities between users:

Euclidean Distance, that is exactly what you are thinking: imagine a n-dimensional graph that has for each axis a song that is reviewed by two involved users (u1 and *u2) and the value on its axis is the score. You can easily calculate similarity using the formula:

for every song reviewed by u1 and u2, calculate pow(u1.song.score - u2.song.score, 2) and add all together into sum_of_powers. Similarity coefficient is then given by 1 / 1 + (sqrt(sum_of_powers)).
Pearson Correlation (or correlation coefficient): it's a better approach that finds how much two data sets are related one with another. This approach uses more complex formulas and a little of statistics background, check it here: wiki. You will have a graph for every couple of users, then you plot points according to scores.. for example if aSong has been voted 2 from u1 and 4 from u2 it will plot the point (2,4) (assuming that user1 is x-axis and u2 is y-axis).

Just to clarify, you use linear regression to find two coefficients A and B, that describe the line that minimizes the distance from all the points of the graph. This line has this formula:y = Ax + B. If two sets are similar points should be near to the main diagonal so A should tend to 1 while B to 0. Don't assume this explaination as complete or as a reference because it lacks soundness and typical math formalism, it just to give you an idea.

EDIT: like written by others, more complex algorithms to cluster data exist, like k-means but I suggest you to start from easy ones (actually you should need something more difficult just when you realize that results are not enough).

I recommend the book Programming Collective Intelligence from Toby Segaran. Chapter 3 describes different clustering methods like Hierarchical Clustering and K-means Clustering.

The source code for the examples is available here

If you want the most accurate results, then no, you'd have to iterate over everything.

If your database is large enough, you could just take a statistical sampling, say taking between 1,000 -10,000 users and matching against that.

You would also be better off to add some more tables to the database, store the results, and only update it every so often, instead of calculating this on the fly.

Ilya Grigorik did a series on recommendation algorithms, though he was focusing on Ruby. It appears to be under the machine learning section in his archives, but there isn't a direct section link.

If you want to do it in a approximate way without visiting all the records, you can use the Jaccard Coefficient. Probably needs some adaptation if you want to consider the scores. But I guess that's the best solutions if your system is too big and you don't have the time to check all the records.

I think a lot of people on here are missing the simplicity of the question. He didn't say anything about creating a rating prediction system. He just wants to compute the similarity between each user's song rating behavior and each other user's song rating behavior. The Pearson correlation coefficient gives exactly that. Yes, you must iterate over every user/user pair.

EDIT:

After thinking about this a little more:

Pearson is great if you want the similarity between two users' tastes, but not their level of "opinionatedness"... one user who rates a series of songs 4, 5, and 6 will correlate perfectly with another user who rates the same songs 3, 6, and 9. In other words, they have the same "taste" (they would rank the songs in the same order), but the second user is much more opinionated. In other other words, the correlation coefficient treats any two rating vectors with a linear relationship as equal.

However, if you want the similarity between the actual ratings the users gave each song, you should use the root mean squared error between the two rating vectors. This is a purely distance based metric (linear relationships do not play into the similarity score), so the 4,5,6 and 3,6,9 users would not have a perfect similarity score.

The decision comes down to what you mean by "similar"...

That is all.

You should be able to find a good algorithm in this book: The Algorithm Design Manual by Steven Skiena.

The book has a whole bunch of algorithms for various purposes. You want a graph clustering algorithm, I think. I don't have my copy of the book handy, so I can't look it up for you.

A quick Google search found a Wikipedia page: http://en.wikipedia.org/wiki/Cluster_analysis Perhaps that will help, but I think the book explains algorithms more clearly.