I've heard several times that you sho开发者_如何转开发uldn't perform COUNT(*)
or SELECT *
for performance reasons, but wasn't able to dig up some further information about it.
I can imagine that the database is then using all columns for the action, which can be an impressive performance loss, but I'm not sure about that. Does somebody have further information about the topic?
1. On count(*) vs. count(something else)
SQL is declarative in that you specify what you want. This is different from specifying how to get what you want. That means the database engine is free to realize your query in whatever way it thinks is the most efficient. Many database optimizers rewrites your query to a less costly alternative (if such a plan is available).
Given the following table:
table(
pk not null
,color not null
,nullable null
,unique(pk)
,index(color)
);
...all of the following are functionally equivalent (due to the mechanics of count and nulls):
1) select count(*) from table;
2) select count(1) from table;
3) select count(pk) from table;
4) select count(color) from table;
Regardless of which form you use, the optimizer is free to rewrite the query to another form if it is more efficient. (Again, not all optimizers are sophisticated enough to do this). The unique index(pk) would be smaller (bytes occupied) than the entire table. Therefore it would be more efficient to count the number of index entries rather than scanning through the entire table. In Oracle we have bitmap indexes, which also compress repeating strings. If we had used such an index on the color column, it would probably have been the smallest index to scan. Oracle also supports table compression which in some cases makes the physical table smaller than a composite index.
1. TL;DR;
Your specific dbms will have its own set of tools that enables different rewriting rules and in turn execution plans. That renders the question somewhat useless (unless we talk about a specific release of a specific dbms). I recommend COUNT(*)
in all cases because it requires the least cognitive effort to grasp.
2. On select a,b,c vs. select *
There are very few valid uses of SELECT *
in code you write and put into production. Imagine a table which contains Bluray movies (yes, the movies is stored as a blob in this table). So you slapped together your awesomesauce abstraction layer and put SELECT * FROM movies where id = ?
in the getMovies(movie_id)
method. I will refrain myself from explaining why SELECT name FROM movies
will be transported across the network just a tad faster. Of course, in most realistic cases it won't have a noticable impact.
One last point on performance is that when all the referenced columns (selected, filtered) in your query exists as an index (called a covering index), the database need not touch the table at all. It can be fully resolved from scanning the index only. By selecting all columns you remove this option from the optimizer.
Another thing about SELECT *
which is far more serious than anything, is that it creates an implicit dependency on a specific physical layout of the table. Let me explain. Consider the following tables:
table T1(name, id)
table T2(name, id)
The following statement...
insert into t1 select * from t2;
... will break or produce a different result if any of the following happens:
- Any of the tables columns are rearranged for example T1(id, name)
- T1 gets an additional not-null column
- T2 gets another column
2. TL;DR; When possible, explicitly specify the columns you want (eventually, you'll have to do that anyway). Also, selecting fewer columns are faster than selecting more columns. A possitive side-effect on explicit selects is that it gives greater freedom to the optimizer.
COUNT(*)
is different from COUNT(column1)
!
COUNT(*)
returns the number of records, and does NOT use more resources, while COUNT(column1)
counts the number of records where column1 is non null.
For SELECT
, it is different. SELECT *
will of course request more data.
When using count(*)
the *
doesn't mean "all fields". Using count(field)
will count all non-null values in the field, but count(*)
will always count all records even if all fields in all records are null, so it doesn't need to check the data in the fields at all.
Using select *
means that you almost always return more data than you are going to use, which of course is a waste. However, perhaps more serious is the maintainence problem; if you add fields to a table your query will return these too. That might mean that the record becomes too large to fit in the buffer, resulting in an error message.
Don't confuse the * in "COUNT(*)" with the * in "SELECT * ". They are completely unrelated but sometimes confused because it's such an odd syntax. There is nothing wrong with using COUNT(*), which just means "count rows".
SELECT * on the other hand means "select all columns". That's generally poor practice because it tightly couples your code to the database schema. That means when you change the table you probably have to change the code even if it should have been unaffected. It increases the impact of any schema change.
SELECT * may also cause a sub-optimal query plan. Either because you didn't really need all columns or because it forces the DBMS to do an extra lookup at runtime to get the list of columns.
It's absolutely true that "*" is "all columns". And you're right in the point of if you've a table with an incredible number of columns (say 100+), these kind of queries can be bad in terms of efficiency.
I believe that the best solution is creating database views previously filtering the amount of records evolved in the count operation, so, the performance impact isn't a big problem, because views can be cached.
In the other hand, it seems that "*" operator should be avoided when returning records, and it's brutally better to select the fields you really need to use in some business.
When using SELECT *
it can have a performance hit. Applications which use the SELECT *
syntax when they actually only need a handful of columns are transferring more data across the network than they need to consume, which is wasteful.
Also, in Microsoft SQL Server at least, there's a strange problem when you use SELECT *
in a view and then add a column to the underlying table. The column headings and data returned by the view don't match each other following certain changes! See my blog post for further details of this particular problem.
Depending on the size of the database depends on how inefficient it becomes, the simnplest way to describe would be like so:
when you specifically do:
SELECT column1,column2,column3 FROM table1
Mysql knows exactly exactly what columns it looking for, but when you do
SELECT * FROM table1
Mysql does not know the columns you want, it knows you want all of them but not the names, so it has to perform extra tasks that analyse the table to discover the columns, thus resulting in using resources.
In case of COUNT(*)
it depends on database and its version. For example in modern versions of MS SQL
it doesn't matter [source needed].
So the best approach in case of COUNT(*)
is to measure it.
Using SELECT *
is really bad idea. *
means read all columns which can be heavy IO and network operation (especially for various type of CHAR
columns). Moreover -- rather rarely you need all columns.
精彩评论