开发者

jOINING tables with millions of rows

开发者 https://www.devze.com 2023-04-11 01:27 出处:网络
i have the sql below that returns 35k rows and take about 10 mins to run. both tables have millions of row. how can i improve this sql?

i have the sql below that returns 35k rows and take about 10 mins to run. both tables have millions of row. how can i improve this sql?

SELECT /*+ index(T_DIRECTORY X_DIR) */ 
                 DIRx.dir_id ,                   
                 base.id       
                 FROM T_DIRECTORY DIRx,  T_PERSON base 
                 WHERE 
                   DIRx.id = 26463
                   and DIRx.PERSONID=  base.PERSONID

'| Id  | Operation                    | Name           |'
'-------------------------------------------------------'
'|   0 | SELECT STATEMENT             |                |'
'|   1 |  NESTED LOOPS                |                |'
'|   2 |   TABLE ACCESS BY INDEX ROWID| T_DIRECTORY |'
'|   3 |    INDEX RANGE SCAN          | X_DIRECTORY |'
'|   4 |   TABLE ACCESS BY INDEX ROWID| T_PERSON      |'
'|   5 |    INDEX UNIQUE SCAN         | I_PERSON  开发者_如何学编程    |'


First, ensure you have suitable indexes on the columns in the where clause (DIRx.id) and on the table being joined (base.personid) and that these indexes are analyzed so they represent the data in the table - if not analyzed, Oracle may do a full tablescan when it could use the index instead.

SELECT INDEX_NAME, 
       NUM_ROWS, 
       LAST_ANALYZED 
  FROM DBA_INDEXES 
 WHERE TABLE_NAME IN ('T_DIRECTORY','T_PERSON');

Also you are forcing it to use an index by using a hint but if one table is smaller than the other, a hash join might be a better solution so maybe try removing the hint altogether and seeing if it helps.

Parallel Query

Do you have multiple CPUs and nothing else running when this SQL runs- i.e. is it part of a batch process or part of an online process that could be called multiple times simultaneously. If a batch process and you have multiple CPUs, try parallel query but do not do this if it's an online program (e.g. a report as using parallel query will try and use all available CPU and performance could get worse if it is run multiple times simultaneously or if you run more parallel threads than 2 per CPU core.

In practice parallel threads will approximately half the execution time per 4 threads.

Clustered tables/indexes

If these tables are always joined this way, you can consider a clustered table (where oracle will store the joining rows of each table in the same blocks so it doesnt have to spend so long retrieving the joined part but this can have a downside if you also frequently access one of the tables separately.

Context

Looking at a query in isolation doesnt alway reveal the best answer - doing something really fast when it might be the wrong thing doesnt help so look at the context i.e. what will you do with the 35000 rows once returned, are they ones added today only, is there a table that has the subset that could be used instead ?


"i had to denormalize the table"

In an OLTP database de-normalization is always bad news. It speeds up some queries but can slow down others. At best it imposes a performance hit on DML operations, at worst it can lead to data corruption. It certainly makes our applications more complicated.

@trevorNorth makes a valid point about context. Data distribution matters a lot. How many rows in T_DIRECTORY match that ID ? How many rows in that result set have matching rows in T_PERSON ? Do those actual values match the cardinalities in the explain plan? If not perhaps refreshing the database statistics will allow the database to discover a better plan (without the INDEX hint). But if your statistics are up-to-date then you need a different solution.

Such as ... what?

Tuning is a hard task because there are so many things to consider, and the specifics really matter . Some consultants make a very fine living from doing nothing but fixing performance bugs in other peoples' code. They wouldn't be able to do that if tuning was simple.

In the absence of hard facts here are a couple of guesses. Alternative ways to solve this problem:

  1. Eliminate the table reads and perhaps get a hash join as well by building compound indexes:
    T_DIRECTORY(ID, PERSONID, DIR_ID)
    T_PERSON(PERSONID,ID)
  2. Build a fast refresh materialized view on the statement, and allow Query Rewriting to satisfy queries.


i had to denormalize the table.

0

精彩评论

暂无评论...
验证码 换一张
取 消