I came across a question today of search efficiency for large sets today and I've done by best to boil it down to the most basic case. I feel like this sort of thing probably relates to some classic problem or basic concept I'm missing, so a pointer to that would be great.
Suppose I have a table definition like
CREATE TABLE foo(
id int,
type bool,
reference int,
PRIMARY KEY(id),
FOREIGN KEY(reference) REFERENCES foo(id),
UNIQUE KEY(reference)
) Engine=InnoDB;
Populated with n rows where n/2 are randomly assigned type=1. Each row references another with its same type except for the first, which has reference=null.
Now we want to print all items with type 1. I assume that at some point, it will be faster to recursively call something like
function printFoo1($ref){
if($ref==null)
return;
$q = 'SELECT id, reference FROM foo WHERE id='.$ref;
$arr = mysql_fetch_array( mysql_query($q) );
echo $arr[0];
printFoo1($arr[1]);
}
As opposed to
function printFoo2($ref){
$q = 'SELECT id FROM foo WHERE type=1';
$res = mysql_query($q);
while( $id = mysql_fetch_array($res) ){
echo $id[0];
}
}
The main point here being that function 1 searches for "id", which is indexed, whereas function 2 has to make n/2 comparisons that don't result in a hit, but that the overhead of multiple queries is going to be significantly greater than the single SELECT.
Is my 开发者_运维问答assumption correct? If so, how large of a data set would we need before function 1 outperforms function 2?
Your example is a bit difficult to parse, but ill start at the top:
Your first function does not return all of the elements with type = 1. It returns all of the elements that are dependent (based on references) to the element you pass in. From the PHP standpoint, since the link/handle is already open there is a non-trivial overhead from your function call with each successive request, not to mention the string concatenation incurring a cost with each execution of that line.
Typically it is better to use the second function styling because it only queries the database one time and will return the elements you are requesting without further work. It will come down to a profiler of course, to determine which is going to return faster, but from my tests the second is hands down the better choice:
This was executed with n = 5000 elements in the db (n/2 = 2500 type 1 and passing in reference = highest id with type = 1 from query of db).
printFoo1: 3.591840 seconds
printFoo2: 0.010340 seconds
It wouldn't really make sense for it to work any other way. If you were able to do what you propose that would make JOIN
calls have to perform less efficient as well.
Code
$res = mysql_query('SELECT MAX( id ) as `MAX_ID` FROM `foo` WHERE `type` = 1', $link);
$res2 = mysql_fetch_assoc($res);
$id = $res2['MAX_ID'];
// cleanup result and free resources here
echo "printFoo1: ";
$start = microtime(true);
printFoo1($id);
echo microtime(true) - $start;
echo '<br />';
echo "printFoo2: ";
$start = microtime(true);
printFoo2();
echo microtime(true) - $start;
mysql_close($link);
All of this was tested on PHP 5.2.17 running on Linux
精彩评论