I have a directed graph with millions of vertices and edges. A set of vertices are given, let's assume that they are called "START_POINTS". Another set of vertices, called "END_POINTS" are also given. The problem is to find which END_POINTS can be reached from which START_POINTS.
Here is an example:
START_POINTS: S1 S2 S3 S4 S5 S6 S7 ...
END_POINTS : E1 E2 E3 E4 E5 E6 E7 ...
The algorithm should be able tell the following:
S1 can reach to E1, E2, E6
S2 can reach to E9, E10
S3 cannot reach any END_POINT
S4 can reach to .....
....
Some of the END_POINTS might not be reached from any START_POINT.
Now, the question is: What is the most efficient way to implement it?
I tried starting from each START_POINT one-by-one and finding the reachable END_POINTS using depth-first search (or BFS, it does change the run-time much). However, it takes a lot of time because there are so many START_POINTS (there are also a lot of END_POINTS).
The search can be optimized because there is a huge overlap between the traced paths of START_POINTS. One needs to remember which paths can reach which END_POINTS. What is the most efficient way to accomplish this? This might be well-known problem but I could not find a solution yet.
EDIT on Jan 6:
I tried to implement highBandWidth's idea (in a开发者_运维技巧 way similar to what Keith Randall proposed) : For each node, if this node is not START or END point, connect all of inputs to outputs, then remove the node.
foreach NODE in NODES
Skip if NODE is START_POINT or END_POINT
foreach OUTPUT_NODE of NODE
Disconnect NODE from INPUT_NODE
end
foreach INPUT_NODE of NODE
Disconnect NODE from INPUT_NODE
foreach OUTPUT_NODE of NODE
Connect INPUT_NODE to OUTPUT_NODE
end
end
Remove NODE from NODES
end
This starts very fast and quickly becomes very slow, mainly because the input/output counts of remaining nodes get very large and nested for loops kills the performance. Any idea how it can be made more efficient?
Just prune the graph to get rid of all nodes that don't appear in Start Nodes or End Nodes, replacing them by edges from their inbound nodes to their outbound targets. Once that is done, go through all the other nodes (that are Start or End nodes), and add edges from their inbound nodes to their outbound nodes, without deleting these nodes. In a couple of Djikstra-like iterations, you should be left with the Start to End edges.
This might be overkill but you might want to check out Dijkstra. I used it in the past when creating my own routing table of virtual nodes. In this case all your nodes would have a value of 1, meaning each node costs the same travel wise.
First, run a strongly connected components algorithm. Then contract all the connected components to a single node. Then perform a topological sort of the graph. Then, in a single pass, you can compute which start nodes can reach each node in the graph (initialize each start node s to the set {s}, then perform union of the incoming edges at each node in topological order).
You have a potential problem in that the answer could be as large as # start nodes * # end nodes, which can be big. Hopefully you have some big SCCs that will contract to single nodes, as this could make the answer much more concise (all start nodes in the same SCC can reach the same places, so you need only use one representative in your sets).
精彩评论