Algorithm for finding similar users through a join table_问答_开发者

I have an application where users can select a variety of interests from around 300 possible interests. Each selected interest is stored in a join table containing the columns user_id and interest_id.

Typical users select around 50 interests out of the 300.

I would like to build a system where users can find the top 20 users that have the most interests in common with them.

Right now I am able to accomplish this using the following query:

SELECT i2.user_id, count(i2.interest_id) AS count 
  FROM interests_users as i1, interests_users as i2
    WHERE i1.interest_id = i2.开发者_JAVA百科interest_id AND i1.user_id = 35
  GROUP BY i2.user_id
  ORDER BY count DESC LIMIT 20;

However, this query takes approximately 500 milliseconds to execute with 10,000 users and 500,000 rows in the join table. All indexes and database configuration settings have been tuned to the best of my ability.

I have also tried avoiding the use of joins altogether using the following query:

select user_id,count(interest_id) count
  from interests_users
    where interest_id in (13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,508)
  group by user_id 
  order by count desc 
  limit 20;

But this one is even slower (~800 milliseconds).

How could I best lower the time that I can gather this kind of data to below 100 milliseconds?

I have considered putting this data into a graph database like Neo4j, but I am not sure if that is the easiest solution or if it would even be faster than what I am currently doing.

The code you posted as your answer is incorrect. By storing the counts in a hash, you will be forgetting many users, since you will be only keeping one user per total. If two users have the same interests (or at least have the same number of matching interests with the current user), for instance, than your t variable will be the same and the first one looked at will be overwritten by the second.

Here is a correct version of the code you posted as an answer. It is shorter and more idiomatic and should be faster too. Note that I've used true and false instead of 1 and 0.

USERS_COUNT = 10_000
INTERESTS_COUNT = 500

users = Array.new(USERS_COUNT) { rand(100000)+100000 }

table = Array.new(INTERESTS_COUNT) do
  Array.new(USERS_COUNT) { rand(10) == 0 }
end

s = Time.now
cur_user = 0
cur_interests = table.each_index.select{|i| table[i][cur_user]}

scores = Array.new(USERS_COUNT) do |user|
  nb_match = cur_interests.count{|i| table[i][user] }
  [nb_match, users[user]]
end

scores.sort!

puts Time.now.to_f - s.to_f

BTW, you could squeeze a bit more performance by transposing the table, which would avoid half of the lookups.

What you are talking about is called clustering.

Clustering is a hard issue, and computing it on the fly requires more resources than we wish to spare I am afraid, because a full computation is O(N²).

I think looking for ideas down this road is unlikely to yield any result (I may be wrong) because of the inherent complexity of the issue.

However, we don't have to actually compute it all from scratch each time. I haven't been able to figure out an evolving picture (reasonable) and how to update it.

I can however figure out how to cache the result!

UserId*  |  LinkedUserId*  |  Count
35       |  135            |  47
35       |  192            |  26

(One index for UserId and another for LinkedUserId, the constraint of unicity is that there should never be 2 rows with the same UserId/LinkedUserId pair)

Whenever you need to get the group for this user, first consult the cache table.

Now, we also need to invalidate some cache entries from time to time: each time a user adds or removes an interest, then it potentially affects all the users linked to her.

When a user adds an entry, invalidates all the cache lines of users using this interest.

When a user removes an entry, invalidates all the cache lines of users linked to her.

Honestly, I am not sure it would perform better.

SELECT DISTINCT TOP 20 b.user_id, SUM(1) OVER (PARTITION BY b.user_id) AS match
  FROM interests_users a
  LEFT OUTER JOIN interests_users b ON a.interest_id = b.interest_id AND b.user_id <> 35
 WHERE a.user_id = 35 AND b.user_id IS NOT NULL
 ORDER BY 2 DESC

If you build good indexes, you should be fine.

I was actually able to basically get what I was looking for by doing this in pure Ruby.

First I create a two dimensional array where each column is a user and each row is an interest. Each value in the array is a 0 or a 1 depending on whether the current user has that interest. This array is stored in memory with functions to add or modify rows and columns.

Then when I want to calculate the users with similar interests to the current user I add up all the columns for each row where the column is set to "1" for the current user. This means I need to iterate through 10,000 columns and run an average of 50 addition operations per column followed by a sorting operation at the very end.

You might guess that this takes a very long time, but it's actually about 50-70 milliseconds on my machine (Core 2 Duo, 3ghz. Ruby 1.9.1), and about 110 milliseconds on our production servers. The nice thing is that I don't need to even limit the result set.

Here is the ruby code I used to test my algorithm.

USERS_COUNT = 10_000
INTERESTS_COUNT = 500

users = []
0.upto(USERS_COUNT) { |u| users[u] = rand(100000)+100000 }

a = []
0.upto(INTERESTS_COUNT) do |r|
  a[r] = []
  0.upto(USERS_COUNT) do |c|
    if rand(10) == 0 # 10% chance of picking an interest
      a[r][c] = 1
    else
      a[r][c] = 0
    end
  end  
end

s = Time.now

countable_rows = []

a.each_index { |i| countable_rows << i unless a[i][0].zero? }

b = {}
0.upto(USERS_COUNT) do |c|
  t = 0
  countable_rows.each { |r| t+= a[r][c] }
  b[t] = users[c]
end
b = b.sort {|x,y| y[0] <=> x[0] }

puts Time.now.to_f - s.to_f

The first few lines are used to create a simulated two dimensional array. The rest of the program runs the algorithm as I described above.

The algorithm above scales reasonably well for a while. Obviously it would not be suitable for 50,000+ users, but since our product segments communities into smaller groups this method works quite well (And much faster than SQL).

Any suggestions on how it could be tuned for even better performance are welcome.