Why is MySQL with InnoDB doing a table scan when key exists and choosing to examine 70 times more rows?_问答_开发者

I'm troubleshooting a query performance problem. Here's an expected query plan from explain:

mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:16';
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
| id | select_type | table              | type  | possible_keys | key          | key_len | ref  | rows    | Extra       |
+----+-------------+--------------------+-------+---------------+--------------+---------+------+------开发者_Python百科---+-------------+
|  1 | SIMPLE      | table1             | range | tdcol         | tdcol        | 8       | NULL | 5437848 | Using where | 
+----+-------------+--------------------+-------+---------------+--------------+---------+------+---------+-------------+
1 row in set (0.00 sec)

That makes sense, since the index named tdcol (KEY tdcol (tdcol)) is used, and about 5M rows should be selected from this query.

However, if I query for just one more minute of data, we get this query plan:

mysql> explain select * from table1 where tdcol between '2010-04-13 00:00' and '2010-04-14 03:17';
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
| id | select_type | table              | type | possible_keys | key  | key_len | ref  | rows      | Extra       |
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
|  1 | SIMPLE      | table1             | ALL  | tdcol         | NULL | NULL    | NULL | 381601300 | Using where | 
+----+-------------+--------------------+------+---------------+------+---------+------+-----------+-------------+
1 row in set (0.00 sec)

The optimizer believes that the scan will be better, but it's over 70x more rows to examine, so I have a hard time believing that the table scan is better.

Also, the 'USE KEY tdcol' syntax does not change the query plan.

Thanks in advance for any help, and I'm more than happy to provide more info/answer questions.

5 million index probes could well be more expensive (lots of random disk reads, potentially more complicated synchronization) than reading all 350 million rows (sequential disk reads).

This case might be an exception, because presumably the order of the timestamps roughly matches the order of the inserts into the table. But, unless the index on tdcol is a "clustered" index (meaning that the database ensures that the order in the underlying table matches the order in tdcol), its unlikely that the optimizer knows this.

In the absence of that order correlation information, it would be right to assume that the 5 million rows you want are roughly evenly distributed among the 350 million rows, and thus that the index approach will involve reading most or nearly all of the pages in the underlying row anyway (in which case the scan will be much less expensive than the index approach, fewer reads outright and sequential instead of random reads).

MySQL's query generator has a cutoff when figuring out how to use an index. As you've correctly identified, MySQL has decided a table scan will be faster than using the index, and won't be dissuaded from it's decision. The irony is that when the key-range matches more than about a third of the table, it is probably right. So why in this case?

I don't have an answer, but I have a suspicion MySQL doesn't have enough memory to explore the index. I would be looking at the server memory settings, particularly the Innodb memory pool and some of the other key storage pools.

What's the distribution of your data like? Try running a min(), avg(), max() on it to see where it is. It's possible that that 1 minute makes the difference in how much information is contained in that range.

It also can just be the background setting of InnoDB There are a few factors like page size, and memory like staticsan said. You may want to explicitly define a B+Tree index.

"so I have a hard time believing that the table scan is better."

True. YOU have a hard time believing it. But the optimizer seems not to.

I won't pronounce on your being "right" versus your optimizer being "right". But optimizers do as they do, and, all in all, their "intellectual" capacity must still be considered as being fairly limited.

That said, do your database statistics show a MAX value (for this column) that happens to be equal to the "one second more" value ?

If so, then the optimizer might have concluded that all rows satisfy the upper limit anyway, and mighthave decided to proceed differently, compared to the case when it has to conclude that, "oh, there are definitely some rows that won't satisfy the upper limit either, so I'll use the index just to be on the safe side".