I have a table with a billion rows and I would like to determine the average time and standard deviation of time 开发者_如何学JAVAfor several queries of the form:
select * from mytable where col1 = '36e2ae77-43fa-4efa-aece-cd7b8b669043';
select * from mytable where col1 = '4b58c002-bea4-42c9-8f31-06a499cabc51';
select * from mytable where col1 = 'b97242ae-9f6c-4f36-ad12-baee9afae194';
....
I have a thousand random values for col1 stored in another table.
Is there some way to store how long each of these queries took (in milliseconds) in a separate table, so that I can run some statistics on them? Something like: for each col1 in my random table, execute the query, record the time, then store it in another table.
A completely different approach would be fine, as long as I can stay within PostgreSQL (i.e., I don't want to write an external program to do this).
You need to change your PostgreSQL configuration file.
Do enable this property:
log_min_duration_statement = -1 # -1 is disabled, 0 logs all statements
# and their durations, > 0 logs only
# statements running at least this number
# of milliseconds
After that, execution time will be logged and you will be able to figure out exactly how bad (or good) are performing your queries.
You can also use some LOG PARSING utilities to provide awesome HTML output for further analysis such as pgfouine.
Are you aware of the EXPLAIN
statement?
This command displays the execution plan that the PostgreSQL planner generates for the supplied statement. The execution plan shows how the table(s) referenced by the statement will be scanned — by plain sequential scan, index scan, etc. — and if multiple tables are referenced, what join algorithms will be used to bring together the required rows from each input table.
The most critical part of the display is the estimated statement execution cost, which is the planner's guess at how long it will take to run the statement (measured in units of disk page fetches). Actually two numbers are shown: the start-up time before the first row can be returned, and the total time to return all the rows. For most queries the total time is what matters, but in contexts such as a subquery in EXISTS, the planner will choose the smallest start-up time instead of the smallest total time (since the executor will stop after getting one row, anyway). Also, if you limit the number of rows to return with a LIMIT clause, the planner makes an appropriate interpolation between the endpoint costs to estimate which plan is really the cheapest.
The
ANALYZE
option causes the statement to be actually executed, not only planned. The total elapsed time expended within each plan node (in milliseconds) and total number of rows it actually returned are added to the display. This is useful for seeing whether the planner's estimates are close to reality.
Could pretty easily write a script which does an EXPLAIN ANALYZE
on your query for each of the random values in a table, and save the output to a file / table / etc.
Directly, no, there's not. But you can do an indirect and pretty close estimate by checking the time right before and right after the query you're interested in timing.
$sql = "Your Query";
$bm = "SELECT extract(epoch FROM clock_timestamp())";
$query = "{$bm}; {$sql}; {$bm};";
Function clock_timestamp() gives you the server actual time when the statement starts. Since that SELECT involves no tables, we can expect it to be almost instantaneous. I guess any Pg driver offers support for multiple queries; it is important that these 3 queries (the real one and the 2 extras) go together, else you'd be measuring data transport times as well...
For PHP I have a function to handle this. In summary it goes like:
<?php
function pgquery($sql, $conn)
{
// Prepend and append benchmarking queries
$bm = "SELECT extract(epoch FROM clock_timestamp())";
$query = "{$bm}; {$sql}; {$bm};";
// Execute the query, and time it (data transport included)
$ini = microtime(true);
pg_send_query($conn, $query);
while ($resource = pg_get_result($conn))
{
$resources[] = $resource;
}
$end = microtime(true);
// "Extract" the benchmarking results
$q_ini = pg_fetch_row(array_shift($resources));
$q_end = pg_fetch_row(array_pop($resources));
// Compute times
$time = round($end - $ini, 4); # Total time (inc. transport)
$q_time = round($q_end[0] - $q_ini[0], 4); # Query time (Pg server only)
return $resources;
}
?>
I just left the basics there. $conn holds a link to a Pg connection, and $resources is an array of returned pg resources (in case you sent multiple queries in your $sql).
$time holds total time since the query left for the Pg server until the result arrives. $q-time only holds the actual query time you wanted (or a very good approximation).
Add error handling and other processing to your liking, I have plenty but it's irrelevant to your question.
You CANNOT do this in SQL, because even if you would be able to call each of this statements in a loop, each call to now() would return the same result, because you are in a single transaction.
It is just possible by creating a own volatile now() function, returning another value on each invocation.
Here is a sql version of @user832146 proposal:
CREATE OR REPLACE FUNCTION "get_sql_runtime"(
PAR_sql TEXT
, OUT sql_runtime REAL
)
AS $$
DECLARE
run_time_start TIMESTAMP WITH TIME ZONE;
run_time_end TIMESTAMP WITH TIME ZONE;
BEGIN
SELECT clock_timestamp() INTO run_time_start;
EXECUTE PAR_sql;
SELECT clock_timestamp() INTO run_time_end;
SELECT EXTRACT(EPOCH FROM (run_time_end - run_time_start)) INTO sql_runtime;
END; $$
LANGUAGE plpgsql
VOLATILE;
You use it as:
SELECT * FROM get_sql_runtime('SELECT * FROM "my_table_name"');
And it will yield a result like:
sql_runtime |
---|
1.47588 |
NOTE: the returned value is in seconds.
精彩评论