I have a table representing a graph: Edges(from, to).
I'd like to query this table with "path queries", retrieving only the source and destination of the path.
Fo开发者_开发问答r example, assume my table consists of the following rows:
+------+----+
| from | to |
+------+----+
| a | b |
| b | c |
| c | d |
| f | g |
| b | f |
| c | a |
+------+----+
Assume I execute the following (pseudo-)query:
SELECT ?v1, ?v2 WHERE ?v1 to ?t1, ?t1 to ?t2, ?t2 to ?v2;
This means I want all the pairs of source and destination that exist in all paths consisting of 4 nodes. Executing this query should return the following results:
+-----+-----+
| ?v1 | ?v2 |
+-----+-----+
| a | a |
| a | g |
| a | d |
+-----+-----+
Of course, paths consisting of a different number of nodes might be needed as well, the number 4 isn't hardcoded :-)
My questions are:
- What's the best way to build such an SQL query (note that I'm using SQLite, so recursive queries can't be used).
- I currently have one index for the from column and one for the to column. Is this optimal? Should I create an index for the "from, to pair as well? Instead?
Assumptions
There are no self-edges (E.G "a - a").
There are no two identical rows.
Thanks in advance!
ad 1.) unless you know in advance that your paths will always be of a given length (or a small set of given lengths), you cannot express your query in pure sql. however, you may choose to incrementally maintain the transitive closure of your graph, in particular if
- changes to your graph are infrequent
- and/or are mostly edge insertions (instead of deletions)
- or mostly occur as bulk changes at times that allow for some preprocessing.
the technique is outlined in a paper by dong et al., doi://10.1.1.103.3314; don't be daunted by the theory and the math, they also provide sql code ready-to-use and their basic ideas are straightforward.
ad 2.)
if maintaining a transitive closure table is an option for you it wpould lend to one index on the pair of columns representing start and end vertices of paths.
if it isn't you might be able exploit the structure of your graph: for average fan-outs that are high (small) in comparison to fan-ins you should be better of with an index on the 'to' ('from') column.
if you cannot make an assumption on fan-out/fan-in ratios you're probably best off with an index on each column.
hope it helps,
best regards, carsten
精彩评论