开发者

Is it better to filter a resultset using a WHERE clause or using application code?

开发者 https://www.devze.com 2022-12-21 14:46 出处:网络
OK, here is a simple abstraction of the problem开发者_StackOverflow社区: 2 variables(male_users and female_users) to store 2 groups of user i.e. male and female

OK, here is a simple abstraction of the problem开发者_StackOverflow社区:

2 variables(male_users and female_users) to store 2 groups of user i.e. male and female

  1. 1 way is to use two queries to select them :

select * from users where gender = 'male' and then store the result in male_users

select * from users where gender = 'female' and then store the result in female_users

  1. another way is to run only one query:

'select * from users' and then loop over the result set to filter the male users in the program php code snippet would be sth like this:

$result = mysql_query('select * from users');

while (($row=mysql_fetch_assoc(result)) != null) {
  if ($row['gender'] == 'male'){// add to male_users}
  else if ($row['gender'] == 'female'){// add to female_users}
}

which one is more efficient and considered as a better approach?

this is just a simple illustration of the problem. the real project may have lager tables to query and more filter options.

thanks in advance!


The rule of thumb for any application is to let the DB do the things it does well: filtering, sorting, and joining.

Separate the queries into their own functions or class methods:

$men = $foo->fetchMaleUsers();
$women = $foo->fetchFemaleUsers();

Update

I took Steven's PostgreSQL demonstration of a full table scan query performing twice as good as two separate indexed queries and mimicked it using MySQL (which is used in the actual question):

Schema

CREATE TABLE `gender_test` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `gender` enum('male','female') NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=26017396 DEFAULT CHARSET=utf8

I changed the gender type to not be a VARCHAR(20) as it is more realistic for the purpose of this column, I also provide a primary key as you would expect on a table instead of an arbitrary DOUBLE value.

Unindexed Results

mysql> select sql_no_cache * from gender_test WHERE gender = 'male';

12995993 rows in set (31.72 sec)

mysql> select sql_no_cache * from gender_test WHERE gender = 'female';

13004007 rows in set (31.52 sec)

mysql> select sql_no_cache * from gender_test;

26000000 rows in set (32.95 sec)

I trust this needs no explanation.

Indexed Results

ALTER TABLE gender_test ADD INDEX (gender);

...

mysql> select sql_no_cache * from gender_test WHERE gender = 'male';

12995993 rows in set (15.97 sec)

mysql> select sql_no_cache * from gender_test WHERE gender = 'female';

13004007 rows in set (15.65 sec)

mysql> select sql_no_cache * from gender_test;

26000000 rows in set (27.80 sec)

The results shown here are radically different from Steven's data. The indexed queries perform almost twice as fast as the full table scan. This is from a properly indexed table using common sense column definitions. I don't know PostgreSQL at all, but there must be some significant misconfiguration in Steven's example to not show similar results.

Given PostgreSQL's reputation for doing things better than MySQL, or at least as good as, I daresay that PostgreSql would demonstrate similar performance if properly used.

Also note, on this same machine an overly simplified for loop doing 52 million comparisons takes an additional 7.3 seconds to execute.

<?php
$N = 52000000;
for($i = 0; $i < $N; $i++) {
    if (true == true) {
    }
}

I think it's rather obvious what is the better approach given this data.


I'd argue that there's really no reason to make your DB do the extra work of evaluating the WHERE clause. Given that you actually want all the records, you will have to do the work of fetching them. If you do a single SELECT from the table, it will retrieve them all in table-order and you can partition them yourself. If you SELECT WHERE male and SELECT WHERE female, you'll have to hit an index for each operation, and you'll lose some data locality.

For example, if your records on disk are alternating male-female and you have a dataset much larger than memory, you'll likely have to read the entire database twice if you do two separate queries, whereas a single SELECT for both will be a single table scan.

EDIT: Since I'm getting downmodded into oblivion, I decided to actually run the test. I've generated a table

CREATE TEMPORARY TABLE gender_test (some_data DOUBLE PRECISION, gender CHARACTER VARYING(20));

I generated some random data,

select gender, count(*) from gender_test group by gender;
gender | count
--------+----------
female | 12603133
male | 10465539
(2 rows)

First, let's run these tests without indices, in which case I'm quite sure I'm right...

test=> EXPLAIN ANALYSE SELECT * FROM gender_test WHERE gender='male';
QUERY PLAN


Seq Scan on gender_test (cost=0.00..468402.00 rows=96519 width=66) (actual time=0.030..4595.367 rows=10465539 loops=1)
Filter: ((gender)::text = 'male'::text)
Total runtime: 5150.263 ms

test=> EXPLAIN ANALYSE SELECT * FROM gender_test WHERE gender='female';
QUERY PLAN


Seq Scan on gender_test (cost=0.00..468402.00 rows=96519 width=66) (actual time=0.029..4751.219 rows=12603133 loops=1) Filter: ((gender)::text = 'female'::text)
Total runtime: 5418.891 ms

test=> EXPLAIN ANALYSE SELECT * FROM gender_test;
QUERY PLAN


Seq Scan on gender_test (cost=0.00..420142.40 rows=19303840 width=66) (actual time=0.021..3326.164 rows=23068672 loops=1)
Total runtime: 4543.393 ms (2 rows)

Funny, looks like fetching the data in a table scan without the filter is indeed faster! In fact, more than twice as fast! (5150 + 5418 > 4543) Much like I predicted! :-p

Now, let's make an index and see if it changes the results...

CREATE INDEX test_index ON gender_test(gender);

Now to rerun the same queries...

test=> EXPLAIN ANALYSE SELECT FROM gender_test WHERE gender='male';
QUERY PLAN


Bitmap Heap Scan on gender_test (cost=2164.69..195922.27 rows=115343 width=66) (actual time=2008.877..4388.348 rows=10465539 loops=1)
Recheck Cond: ((gender)::text = 'male'::text)
-> Bitmap Index Scan on test_index (cost=0.00..2135.85 rows=115343 width=0) (actual time=2006.047..2006.047 rows=10465539 loops=1)
Index Cond: ((gender)::text = 'male'::text)
Total runtime: 4941.64 ms

test=> EXPLAIN ANALYSE SELECT * FROM gender_test WHERE gender='female';
QUERY PLAN


Bitmap Heap Scan on gender_test (cost=2164.69..195922.27 rows=115343 width=66) (actual time=1915.385..4269.933 rows=12603133 loops=1)
Recheck Cond: ((gender)::text = 'female'::text)
-> Bitmap Index Scan on test_index (cost=0.00..2135.85 rows=115343 width=0) (actual time=1912.587..1912.587 rows=12603133 loops=1)
Index Cond: ((gender)::text = 'female'::text)
Total runtime: 4931.555 ms (5 rows)

test=> EXPLAIN ANALYSE SELECT * FROM gender_test;
QUERY PLAN


Seq Scan on gender_test (cost=0.00..457790.72 rows=23068672 width=66) (actual time=0.021..3304.836 rows=23068672 loops=1)
Total runtime: 4523.754 ms

Funny.... scanning the entire table in one go is still twice as fast! (4941 + 4931 vs 4523)

NOTE There's all sorts of ways this is unscientific. I'm running with 16GB of RAM, so the entire dataset fits into memory. Postgres isn't configured to use nearly that much, but disk cache still helps... I'd hypothesize (but can't be assed to actually try) that the effects only get worse once you hit disk. I only tried the default btree Postgres indexing. I'm assuming the PHP partitioning takes no time - not true, but probably a pretty reasonable approximation.

All tests run on a Mac Pro 8-way 2.66 Xeon 16GB RAID-0 7200rpm

Also, this dataset is 26 million rows, which is probably a bit larger than most people care about...

Obviously, raw speed isn't the only thing you care about. In many (most?) applications, you'd care more about the logical "correctness" of fetching them separately. But, when it comes down to your boss saying "we need this to go faster" this will (apparently) give you a 2x speedup. The OP explicitly asked about efficiency. Happy?


If you have 1 million users, do you prefer (considering half of them is male, and half of the is female) :

  • fetching 1 million users from the DB ?
  • or only fetching 500k users from the DB ?

I suppose you will answer saying you prefer to fetch only half the users ;-) And, depending on the condition, if more complex, it could be even less than this.


Basically, fetching less data means :

  • less network used "for nothing" (i.e. to fetch data that will immediatly be discarded)
  • less memory used, especially on the PHP server
  • potentially less disk access on the MySQL server -- as there is less data to fetch from disk

In general cases, we try to avoid fetching more data that necessary ; i.e. we place the filtering on the database side.


Of course, this means you'll have to think about the indexes you'll place on your database's tables : they'll have to fit the needs of the queries you'll be doing.

0

精彩评论

暂无评论...
验证码 换一张
取 消