开发者

How can I speed up queries that are looking for the root node of a transitive closure?

开发者 https://www.devze.com 2023-02-15 22:00 出处:网络
I have a historical transitive closure table that represents a tree. create table TRANSITIVE_CLOSURE ( CHILD_NODE_ID number not null enable,

I have a historical transitive closure table that represents a tree.

create table TRANSITIVE_CLOSURE
  (
    CHILD_NODE_ID number not null enable,
    ANCESTOR_NODE_ID number not null enable,
    DISTANCE number not null enable,
    FROM_DATE date not null enable,
    TO_DATE date not null enable,
    constraint TRANSITIVE_CLOSURE_PK unique (CHILD_NODE_ID, ANCESTOR_NODE_ID, DISTANCE, FROM_DATE, TO_DATE)
  );

Here's some sample data:

CHILD_NODE_ID | ANCESTOR_NODE_ID | DISTANCE 
--------------------------------------------
1             | 1                | 0
2             | 1                | 1
2             | 2                | 0
3             | 1                | 2
3             | 2                | 1
3             | 3                | 0

Unfortunately, my current query for finding the root node causes a full table scan:

select *
from transitive_closure tc
where 
  distance = 0
  and not exists (
  select null
  from transitive_closure tci
  where tc.child_node_id = tci.child_node_id
    and tci.distance <> 0
);

On the surface, it doesn't look too expensive, but as I approach 1 million rows, this particular query is starting to get nasty... especially when it's part of a view that grabs the adjacency tree for legacy support.

Is there a better way to find the root node of a transitive closure? I would like to rewrite all of our old legacy code, but I can't... so I need to build the adjacency list somehow. Getting everything except the root node is easy, so is there a better way? Am I thinking about this problem the wrong way?

Query plan on a table with 800k rows.

OPERATION                                  OBJECT_NAME        OPTIONS         COST 
SELECT STATEMENT                                                              2301 
  HASH JOIN                                                   RIGHT ANTI      2301 
    Access Predicates
      TC.CHILD_NODE_ID=TCI.CHILD_NODE_ID 
    TABLE ACCESS                           TRANSITIVE_CL开发者_Go百科OSURE FULL            961 
      Filter Predicates 
        TCI.DISTANCE = 1 
    TABLE ACCESS                           TRANSITIVE_CLOSURE FULL            962 
      Filter Predicates 
        DISTANCE=0


How long does the query take to execute, and how long do you want it to take? (You usually do not want to use the cost for tuning. Very few people know what the explain plan cost really means.)

On my slow desktop the query only took 1.5 seconds for 800K rows. And then 0.5 seconds after the data was in memory. Are you getting something significantly worse, or will this query be run very frequently?

I don't know what your data looks like, but I'd guess that a full table scan will always be best for this query. Assuming that your hierarchical data is relatively shallow, i.e. there are many distances of 0 and 1 but very few distances of 100, the most important column will not be very distinct. This means that any of the index entries for distance will point to a large number of blocks. It will be much cheaper to read the whole table at once using multi-block reads than to read a large amount of it one block at a time.

Also, what do you mean by historical? Can you store the results of this query in a materialized view?

Another possible idea is to use analytic functions. This replaces the second table scan with a sort. This approach is usually faster, but for me this query actually takes longer, 5.5 seconds instead of 1.5. But maybe it will do better in your environment.

select * from
(
    select
        max(case when distance <> 0 then 1 else 0 end)
            over (partition by child_node_id) has_non_zero_distance
        ,transitive_closure.*
    from transitive_closure
)
where distance = 0
    and has_non_zero_distance = 0;


Can you try adding an index on distance and child_node_id, or change the order of these column in the existing unique index? I think it should then be possible for the outer query to access the table by the index by distance while the inner query needs only access to the index.


Add ONE root node from which all your current root nodes are descended. Then you would simply query the children of your one root. Problem solved.

0

精彩评论

暂无评论...
验证码 换一张
取 消