How Oracle uses statistics data_问答_开发者_运维开发者技术经验分享

In previous question I got comment about Oracle statistics:

Oracle doesn't know that 50M is greater than the number of rows. Sure, it has statistics, but they could be old and wrong - and Or开发者_开发问答acle would never allow itself to deliver an incorrect result only because the statistics are wrong

I was pretty sure that Oracle relies on statistics when preparing query execution plan. Before version 10 it was recommended to refresh statistics from time to time and from 10g Oracle gathers statistics automatically.

Can somebody explain how much Oracle query analyzer relies on statistics data?

Oracle uses statistics a lot, to generate query execution plans. What it does not (and should not) do is use those statistics in a way that will affect query results, which is what you were trying to do with "ROWNUM < 50000000". The statistics may be out of date, or missing. However, this will only mean that Oracle may be slow to generate the correct result, it does not mean that Oracle will return an incorrect result.

If Oracle worked as you hoped, then it might decide that "ROWNUM < 50000000" meant "get all rows" even though the table now contained 60,000,000 rows (but had out of date stats saying it contained only 49,000,000). Fortunately it does not.

Statistics are VERY important for the query optimizer. They should be gathered on a regular basis either automatically or manually.

When executing a query Oracle will produce a pool of available execution plans in order to satisfy your query. Those execution plans are the same from the standpoint that they will return you the same exact result, it's just the road to getting there may be far more efficient for one plan over another. To determine this efficiency, Oracle uses the stats generated on the objects used in each of the execution plans to determine their individual costs. If those stats are non-existent or are stale, the cost associated with each plan will be less accurate and therefore the optimal plan may not be chosen.

Here are some of the key stats that Oracle uses for determining this cost:

Table statistics

 * Number of rows
 * Number of blocks
 * Average row length

Column statistics

 * Number of distinct values (NDV) in column
 * Number of nulls in column
 * Data distribution (histogram)
 * Extended statistics

Index statistics

* Number of leaf blocks
* Levels
* Clustering factor

System statistics

* I/O performance and utilization
* CPU performance and utilization

Statistics are used by the oracle cost based optimizer (CBO) to calculate the relative costs of different ways of executing a query so that the most appropriate one can be chosen.

On the whole this works very well, and is being continually improved. For example in 11g you can gather multicolumn histograms that help greatly with queries having predicates on correlated columns (eg. strongly correlated like month of birth and star sign, or more weakly correlted like gender and height).

However it is not perfect. For example estimating the cardinality of the result set of a join between two tables is reasonably accurate, as is estimating the cardinality from a filter operation, but combining the two requires a lot of estimation that can easily be inaccurate. In some cases these issues can be worked around with hints, or with the use of global temporary tables for intermediate result sets.

Another problem of statistics is that changing them can change the execution plan, so there is more of a movement recently to either discourage continual gathering of statistics, or to analyse the impact of changes to statistics before implementing them.

Look for the Jonathan Lewis book -- it is a very thorough treatment of the subject.