Another interesting interview question that I stumbled upon -
Design the data structures for a very large social network (Facebook, LinkedIn, etc)?
Also, design an algorithm to show the connection, or path, between two people (e.g. me->foo->bar->rob->ron)
I would probably consider an undirected graph of some variety, probably stored as a sparse adjacency matrix. As far as finding the shortest path between two people, since the cost of edges is uniform, I would consider going with a bidirectional search.
Basically, go out in concentric circles centered around each person, where each circle is the person himself, then his friends, then his friends-of-friends, etc., at each step testing if there is any person in both circles. Follow the path from the first person you find inward toward the center of each person, and you've found the shortest path.
You could try other shortest-path algorithms, but in general, most shortest-path algorithms only give you the distance and not the actual path.
You could use a graph database like neo4j
Regarding the algorithm:
I like @sxeraverx answer except the sparse matrix part. An adjency list or simple object graph would be a better choice here. The matrix must allocate memory for each possible connection which is O(n^2) where n is the number of users. A list or object graph will only allocate memory on O(e) where e is the number of connections, which is sparse.
I would use a depth-first search with marking to find the friend. Marking Nodes which you've already traversed is essential since cycles of friends will exist. With a DFS the finding of the path is almost trivial because the stack you're using to execute the DFS is the path. So when you find the friend, you just pop the whole stack and you're done.
A breath first search doesn't have this nice property because the queue used to traverse the graph will have unexplored nodes, so you'll need to keep track of the path using some other structure. A breadth first search might be appropriate if we're expecting the function to be run against people in the same "neighbourhood" of friends and are really concerned about performance.
Another nice property of the DFS is that it can be parallelized. When encountering a new node one can create spawn new DFS processes/threads/whatever to process the nodes children. The new threads must be able to share the marking information through some sort of messaging system. This might be a bit of premature optimization now that I think about it some more. Here's a paper on the subject in case anyone is interested
When we have a large amount of data, we cannot possibly keep all of our data on one machine. That means that for every person we need to store a machine id. We need to take care of the following aspects -
- For each friend ID: machine_id = lookupMachineForUserID(id);
- Go to machine machine_id
- friend = lookupFriend(machine_id);
There can be a lot of optimizations done here. One of them is to reduce the number of jumps from one machine to the other since that is expensive. We can do this by grouping together people belonging to the same country/city together. There are high chances of finding friends in the same city. Similarly there can be other ways to optimize.
I'll try to give a very basic implementation of how our data structures will look like. Ofcourse in reality we have to consider a lot of factors like what if on of the machines goes down, caching data, etc.
public class Server
{
ArrayList<Machine> machines = new ArrayList<Machine>();
}
public class Machine
{
public ArrayList<Person> persons = new ArrayList<Person>();
public int machineID;
}
public class Person
{
private ArrayList<Integer> friends;
private int ID;
private int machineID;
private String info;
private Server server = new Server();
}
I'll try to post the solution to tracing the path between friends later.
I would worry about that it is not possible with a data structure - you possibly talk database sturcture here. Very large is xxx million (100+), and i dont think this can be efficiently dealt with in memory.
Composite Pattern? We may not need to pull up all his "friends-of-friends" so to speak, to the memory.
The DB table design is a different problem
精彩评论