I'm currently dealing with a performance issue of graph implementation.
Used technologies
It is programmed in C++. For the moment, the graph is implemented thanks to the BGL.
About the graph
The managed graph is dynamic and undirected. It has two kinds of structures: a lot of complete subgraphs and few of single edges. The only needed information is the direct neighborhood of a vertex.
Problem statement
At the beginning, the complete subgraphs were small (about 10 vertices) and numerous (about 13k). 开发者_StackOverflow中文版An adjacency list implementation, the BGL's one, was perfect. But now, it is asked to manage few subgraph of 5000 vertices. That means 5000x5000 edges. The performance in time and space are then very poor now.
Rejected solutions
My first thought was to use the adjacency matrix implementation provided by BGL. But it doesn't allow dynamic graph. To resolve this constraint, two solutions: provide a new implementation of adjacency matrix for dynamic graph or use a pool of available vertices in a static graph. After reflection, I think it's not a good idea: the space complexity is still VxV/2.
Final Solution and question
So, here my final solution: don't use the BGL, implement bags of vertices to represent complete subgraphs (no need of edges) and directly connect vertices for the few single edges. By doing so, the space complexity of the biggest subgraph falls to its number of vertices, about 5000.
- Do you think this last solution is the good one?
- If not, which implementation could I use? And why?
Update 1
More information about the graph: The graph has ~100k vertices, ~13k complete subgraphs of about 3 vertices, and ~100 complete subgraphs of size range [10-5000]. And each edge has bundled properties.
Update 2
I've recently learned thanks to Salim Jouilli that the bag of nodes is a candid approach of hypergraph where a hyperedge consists in a subset of nodes.
Update 3
I've finished to implement the solution. I've effectively gain in memory consumption and runtime: from 6GB to 24MB and from 50min to 2min30.
Your final solution is the one I would have gone for myself. It sounds as efficient as it can be.
If your problem will scale up (again) in the future, maybe it is worth the effort of using a graph database. This way you can decouple storage and business logic and treat them as separate problems.
Having 25M edges is no big deal in general. But I only used it on rather sparse graphs with lots of nodes also (a street network).
If the memory use and access time are getting critical for your need, try using boost compressed sparse graph http://www.boost.org/doc/libs/1_46_1/libs/graph/doc/compressed_sparse_row.html
It's a bit annoying to use as it requires the edges to be inserted in an ordered way, but there is probably no way to be significantly more effective (by a very few percent at most).
精彩评论