开发者

Postgresql UNION takes 10 times as long as running the individual queries

开发者 https://www.devze.com 2023-03-12 10:44 出处:网络
I am trying to get the diff between two nearly identical tables in postgresql. The current query I am running is:

I am trying to get the diff between two nearly identical tables in postgresql. The current query I am running is:

SELECT * FROM tableA EXCEPT SELECT * FROM tableB;

and

SELECT * FROM tableB EXCEPT SELECT * FROM tableA;

Each of the above queries takes about 2 minutes to run (Its a large table)

I wanted to combine the two queries in hopes to save time, so I tried:

SELECT * FROM tableA EXCEPT SELECT * FROM tableB
UNION
SELECT * FROM tableB EXCEPT SELECT * FROM tableA;

And while it works, it takes 20 minutes to run!!! I would guess that it would at most take 4 minutes, the amount of time to run each query individually.

I开发者_开发问答s there some extra work UNION is doing that is making it take so long? Or is there any way I can speed this up (with or without the UNION)?

UPDATE: Running the query with UNION ALL takes 15 minutes, almost 4 times as long as running each one on its own, Am I correct in saying that UNION (all) is not going to speed this up at all?


With regards to your "extra work" question. Yes. Union not only combines the two queries but also goes through and removes duplicates. It's the same as using a distinct statement.

For this reason, especially combined with your except statements "union all" would likely be faster.

Read more here: http://www.postgresql.org/files/documentation/books/aw_pgsql/node80.html


In addition to combining the results of the first and second query, UNION by default also removes duplicate records. (see http://www.postgresql.org/docs/8.1/static/sql-select.html). The extra work involved in checking for duplicate records between the two queries is probably responsible for the extra time. In this situation there should not be any duplicate records so the extra work looking for duplicates can be avoided by specifying UNION ALL.

SELECT * FROM tableA EXCEPT SELECT * FROM tableB
UNION ALL
SELECT * FROM tableB EXCEPT SELECT * FROM tableA;


I don't think your code returns resultset you intend it to. I rather think you want to do this:

SELECT * 
  FROM (
        SELECT * FROM tableA 
        EXCEPT 
        SELECT * FROM tableB
       ) AS T1
UNION 
SELECT * 
  FROM (
        SELECT * FROM tableB 
        EXCEPT 
        SELECT * FROM tableA
       ) AS T2;

In other words, you want the set of mutually exclusive members. If so, you need to read up on relational operator precedence in SQL ;) And when you have, you may realise the above can be rationalised to:

SELECT * FROM tableA 
UNION 
SELECT * FROM tableB
EXCEPT 
SELECT * FROM tableA 
INTERSECT
SELECT * FROM tableB;

FWIW, using subqueries (derived tables T1 and T2) to explicitly show (what would otherwise be implicit) relational operator precedence, your original query is this:

SELECT * 
  FROM (
        SELECT * 
          FROM (
                SELECT * 
                  FROM tableA 
                EXCEPT 
                SELECT * 
                  FROM tableB
               ) AS T2
        UNION
        SELECT * 
          FROM tableB
       ) AS T1
EXCEPT 
SELECT * 
  FROM tableA;

The above can be relationalised to:

SELECT * 
  FROM tableB 
EXCEPT 
SELECT * 
  FROM tableA;

...and I think not what is intended.


You could use tableA FULL OUTER JOIN tableB, which would give what you want (with a propre join condition) with only 1 table scan, it probably would be faster than the 2 queries above.

Post more info please.

0

精彩评论

暂无评论...
验证码 换一张
取 消