Quickest algorithm for finding sets with high intersection_问答_开发者

I have a large number of user IDs (integers), potentially millions. These users all belong to various groups (sets of integers), such that there are on the order of 10 million groups.

To simplify my example and get to the essence of it, let's assume that all groups contain 20 user IDs.

I want to find all pairs of integer sets that have an intersection of 15 or greater.

Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.) What is the quickest way to do this? That is, what should my underlying data structure be for representing the integer sets? Sorted sets, unsorted---can hashing somehow help? And what algori开发者_C百科thm should I use to compute set intersection)? I prefer answers that relate C/C++ (especially STL), but also any more general, algorithmic insights are welcome.

Update Also, note that I will be running this in parallel in a shared memory environment, so ideas that cleanly extend to a parallel solution are preferred.

Also, note that the vast majority of set-pairs will have an intersection size of 0---meaning that it might be advantageous to use a data-structure that mapped user IDs to sets to avoid calculating the intersection of every pair of sets.

I would do exactly what you propose: map users to their group. That is, I would keep a list of group ids for every user. Then I would use the following algorithm:

foreach group:
  map = new Map<Group, int>  // maps groups to count
  foreach user in group:
    foreach userGroup in user.groups:
      map[userGroup]++
      if( map[userGroup] == 15 && userGroup.id > group.id )
        largeIntersection( group, userGroup )

Given you have G groups each containing U users on average, and given that these users belong to g groups on average, then this will run in O( G*U*g ). Which, given your problem, is probably much faster than the naive pairwise comparison of groups which runs in O(G*G*U).

If the vast majority of intersections are 0, that means the number of non-empty intersections is relatively small. Give this a try:

Throw away all sets of size <15 before you start
Calculate your lookup from userid -> list of sets to which it belongs
Create a map<pair<userset, userset>, int>
For each user, increment (after creating if necessary), n*(n-1)/2 entries of that map, where n is the number of sets to which the user belongs.
When that's finished, scan the map for entries where the value is greater than 15.

It will use more memory than the simple approach of computing every intersection. In fact it will run up against what's feasible: if each set on average intersects with just 10 others, perhaps in very small intersections, then the map needs 50M entries, which is starting to be a lot of RAM. It's also woefully cache-unfriendly.

It might be faster than doing all the set-intersections, because the O(n^2) terms relate to the number of non-empty intersections and the number of groups to which each user belongs, rather than to the number of sets.

Parallelizing isn't trivial, because of the contention on the giant map. However, you can shard that into a map for each thread, and periodically give one thread a new, empty, map and add the results-so-far into the total results. The different threads then run completely independently most of the time, each given a list of users to process.

Should I compare every pair of sets? (If I keep a data structure that maps userIDs to set membership, this would not be necessary.)

In order to count the degree of intersection, you still need to visit the other groups the user has, which is still cubic. You could have a hash table or other sparse array to count, but it would still at best require an increment for each user for each pair of groups each user is in. If you have N users in G groups with an average of S users per group and T number of groups each user is in, you've got GGS/2 for comparing each pair of groups and NTT if you have an index of user to group. T = GS/N, so NTT=GGSS/N; for S=20 and N in the millions there should be an advantage. Unfortunately, you also need at least G*G storage for the intersection count (25 TB or so for a 4 bit non-sparse counter), and have to ensure the structure can be incremented in parallel.

For a million users in 10 million groups of 20, very approximately the probability that a user is in a given group is 2e-6, and the probability that two groups will share users will be 40e-6, so the 25TB comes down to 1GB data, so not impossible for a sparse array on an ordinary sized computer.

However, comparing set of 20 elements for 15 elements in common has more obvious optimisations

If the groups are sorted, you require no working storage, just output the degree of difference between the input groups directly.
Most of the memory accesses will be linear in contiguous memory areas, and the results are dependent only on the two sets being compared, rather than relying on summing across the whole data set. Accessing main memory randomly is significantly slower than accessing it linearly. Altering main memory randomly with bus locks is orders of magnitude slower than accessing cache without the need to lock the bus (though if you have a couple of GB per core you can use the user->group approach without having to do any synchronisation).
Only need to count 5 elements which are distinct between the sets; if the data is random then most sets will be disjoint, so the average number of elements visited is smaller.
You can quickly discount certain groups by treating the difference as a distance (if A is 11 different from B, and C is 5 different from B, then C is between 6 and 16 different from A, so can be discounted without comparing A and C directly). Since most the sets are completely disjoint, this won't gain you much though.

There is also the option of a hybrid approach, using the user->group map to prune the set of group to group comparisons to be made. This has the advantage of not requiring incrementing of a shared data structure:

for each pair of groups that a user is in, add that pair to the list to investigate.
sort the list of pairs of groups with at least one user in common.
the number of times each pair occurs in the list is the number of users they have in common.

Using merge sort, this is very each to parallelise into pure streaming units. You would be sorting around 20*200*10 million/2 = 20 billion group ID pairs (each group of 20 users times the number of groups each user is in/2).

One way is to see your problem as a metric space radius search problem where the distance function is the number of not matching entries and the radius is r = max(number of elements in sets) - number of equal. Filtering of the found elements is needed to see that enough values are in the Set. So unless someone comes up with a metric function that can be used directly this solutions has to much constraints.

One of the data structures for metric search is the BK-Tree that can be used for string similarity searches.

Candidates for your problem are VP-tree and M-Trees.

The worst case for metric tree is O(n^2) when you're searching for a distance > m (maximum number of elements in the sets) as you build up the tree in O(log n * n) and search in O(n^2).

Apart for that the actual runtime complexity depends on the abbility to prune subtrees of the metric tree while executing the search. In a metric tree a subtree can be skipped if the distance of the pivot element to the search element is larger than the radius of the pivot element (which is at least the maximum distance of a ancestores to the pivot element). If your entry sets are rather disjunct the overall runtime will be dominated by the build-up time of the metric tree O(log n * n).