I'm building a site where I want to match people by common interest. I do this by calculating a weight between each user and determine who are the best match - those who have a high weight:
Example:
user 1 with user 2 = weight of 1
user 1 with user 3 = weight of 10
user 1 with user 4 = weight of 20
I want to put the weights in a DB. The problem is if I have 500,000 users it's 500,000 x 500,000 possible combinations, or 125,000,000,000 entries - in a mysql DB. It's not realistic to insert so much data in one of many tables.
My question is: Is there a way to handle so many pairings with weights using another type of DB? I have read about vectors and things but开发者_JAVA技巧 don't know enough to evaluate this.
I have checked documentation about:
- NoSQL databases: MongoDB
- Object databases: (db4o, Versant )
- Graph databases: neo4j, sones...
- Wide column: Hadoop, HBASE
- Document Store: CouchDB
- Key Value Store: Redis, Voldemort
- Grid Databases: Gigaspaces..
- XML databases.
But of these I'm not seeing a solution. Has anyone experienced this problem and can give me a hint?
I'm going to go out on a limb and say there is no good solution for the question as posed. There seems to be no way to avoid storing the 125B user/weight values given the question as posed.
Looking at another DB type won't help. You simply can't get around the fact that you have 125B values that need to be stored.
There are a couple ways around this
- Find a relationship between users and weights. E.g. if weight is always equal to the sum of the two user IDs (assuming a user has an ID), then you don't have to store the weights.
- Calculate on the fly and do not store
From your explanation I don't think that those weights should be stored at all. They are sort of cache of some calculations that you have done. You don't need to store the result, because you can repeat the calculation whenever you need it. You can still store your weights, but just bare in mind that it's cache, and that data in it is eligible for deletion, when the cache becomes full.
BTW, users usually have filters. These filters may automatically ignore 95% of your user base. You can use that to your advantage.
From the question it seems that the structure represents a mesh, where every user is connected to others (500K X (500k -1)). Sounds very complex. Making some heuristic assumptions, optimizations may be possible.
Assumption Case 1: Not every user pair may have a weight, this may result in a sparse matrix. So why not store non-zero weights alone
Assumption Case 2: I have a strong feeling that the range of weights may be constrained. I don't think there would be 500k different weights, probably 500 different weights. If this is the case, create 500 different groups under which the user pairs are stored. Not much of a space saving, but a method of partitioning.
To achieve space saving using case 2, eliminate the need to store users under these groups. Aggregate the characteristics of interest (a lower bound and a upper bound). To fetch a match for a given user, do the following:
- Traverse the 500 odd weight groupings and fetch the most fitting lower and upper bounds. You won't know the exact user, but you now know how he/she maps.
- Search the user table for the users who fall with in this bounds
- Carry out you more in depth analysis on the actual user group returned by step 2.
My assumptions may be wrong. I which case, just gave a shot buddy.
As long as your design involves storing all the weights for all the combinations, there is no way you can avoid the storage problem. A reasonable space optimization can be achieved only by optimizing your design itself. questzen below suggests some good approaches. The sparse matrix approach might work initially but can get useless as more and more users get connected. It would be better to identify fixed buckets(ranges) of weights instead of absolute weight values for instance.
Alternately, see if you can discard the fully-connected-mesh topology and adopt something like a sparsely connected clusters or a hierarchy etc. If so, then each such cluster can be given an Id and you can have weights for each user with his/her own cluster (a degree of belonging') and weights for a cluster-to-cluster connection. Weight for the connection from user-1 in cluster-1 to user-2 in cluster-2 could then be derived as a function of the inter-cluster weights and the users's 'degree of belonging' to their own clusters.
I think this is a very simple yet interesting question, especially if you can't use any tricks to reduce the number of stored weights. Ultimately, you have key-value pairs where the keys are made out of pairs of users. As long as you only want to retrieve individual weights when given pairs of users, you can use sharding.
If your data isn't changing often and you have multiple computers to work with, then you should be able to implement your own simple sharding strategy or use Gizzard to manage a simple cluster with a compatible key-value datastore on each computer. (Gizzard requires all operations to be commutative and idempotent.)
Are you willing to build a solution from scratch?
If you're up to it, maybe you should create 500000 files, one for each user, and store 500000 weights in each file, sorted by user id, with fixed lengths. You could then go to a specific place in the file you need and read out the value, without using delimiters or actually storing the user ids. (If your user ids are not numbers from 1-500000 you will also need a mapping from user id to a new id that is from 1-500000 and you would sort by this id)
What kind of granularity do you need on your weights?
You can round each weight to the nearest multiple of n/(2^k) that fits your needs. In the case of 3 decimal places, you could store each number as 10 bits, with k=10. This way each file would only be 500000 * 10bits = 625Kb, and the whole dataset would be 312.5Gb. You could even compress the files and only unzip them when needed, depending of course on trade-offs you're willing to make between speed and space. This solution also assumes that changes are rarely made and you're only retrieving one value at a time (or some sort of range of values).
The problem does not exist, in my opinion. Since it is not realistic that a single person knows 500k people. May be one person is known by 500.000 persons, but this person probably knows only a tiny fraction of them in person, e.g. Lady Gaga
Probably a realistic average is 300 for social networks in the whole life. So you have "only" 150 - 200 million relations.
I would go with a graph db, since with them it is quite easy to model the relationships.
精彩评论