I have a table which collects data for web pages performance. There are multiple machines, testing multiple sites in 10 minutes intervals, so currently I have about 700 000 rows (920 MB) with +/- 50 000 new rows daily.
Table source:
SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";
CREATE TABLE `http_perf_raw_log` (
`run_dt` int(11) DEFAULT NULL,
`dataset` varchar(64) DEFAULT NULL,
`runner` varchar(64) DEFAULT NULL,
`site` varchar(128) DEFAULT NULL,
`machine` varchar(32) DEFAULT NULL,
`called_url` varchar(1024) DEFAULT NULL,
`method` varchar(8) DEFAULT NULL,
`url` varchar(1024) DEFAULT NULL,
`content_type` varchar(64) DEFAULT NULL,
`http_code` int(11) DEFAULT NULL,
`header_size` int(11) DEFAULT NULL,
`request_size` int(11) DEFAULT NULL,
`filetime` int(11) DEFAULT NULL,
`ssl_verify_result` int(11) DEFAULT NULL,
`redirect_count` int(11) DEFAULT NULL,
`total_time` decimal(6,4) DEFAULT NULL,
`namelookup_time` decimal(6,4) DEFAULT NULL,
`connect_time` decimal(6,4) DEFAULT NULL,
`pretransfer_time` decimal(6,4) DEFAULT NULL,
`starttransfer_time` decimal(6,4) DEFAULT NULL,
`redirect_time` decimal(6,4) DEFAULT NULL,
`size_upload` int(11) DEFAULT NULL,
`size_download` int(11) DEFAULT NULL,
`speed_download` int(11) DEFAULT NULL,
`speed_upload` int(11) DEFAULT NULL,
`download_content_length` int(11) DEFAULT NULL,
`upload_content_length` int(11) DEFAULT NULL,
`certinfo` varchar(1024) DEFAULT NULL,
`request_header` varchar(1024) DEFAULT NULL,
`return_content` varchar(4096) DEFAULT NULL,
`return_headers` varchar(2048) DEFAULT NULL,
KEY `run_dt_idx` (`run_dt`),
KEY `dataset_idx` (`dataset`),
KEY `runner_idx` (`runner`),
KEY `site_idx` (`site`),
KEY `machine_idx` (`machine`),
KEY `total_time_idx` (`total_time`)
) ENGINE=MyISA开发者_开发百科M DEFAULT CHARSET=latin1;
For aggregating stats (with 1 hour resolution), I created a view:
CREATE OR REPLACE VIEW http_perf_stats (dataset, runner, site, machine, day, hour, calls, total_time, namelookup_time, connect_time, pretransfer_time, starttransfer_time, size_download) AS
SELECT dataset, runner, site, machine,
DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
DATE_FORMAT(run_dt, '%k') AS hour,
COUNT(*) AS calls,
SUM(total_time),
SUM(namelookup_time),
SUM(connect_time),
SUM(pretransfer_time),
SUM(starttransfer_time),
SUM(size_download)
FROM http_perf_raw_log GROUP BY runner, site, machine, day, hour ORDER BY `day` DESC
But the performance of VIEW (and underlying SELECT) is terrible - takes about 4 seconds.
So, my questions:
1. Is using GROUP BY in a VIEW good idea at all? And if not, what is better alternative?
2. Is there ( I imagine yes, I am not SQL expert :/) a way to optimize this SELECT (changing query or structure of http_perf_raw_log)?
Remove the GROUP BY
from the VIEW and use it in the SELECT
that calls the VIEW
.
In this case it might be a good idea to only create statistics periodically (once per hour for example).
I'd do that as follows. Run the following code once to create a table structure.
CREATE TABLE http_perf_stats AS
SELECT dataset, runner, site, machine,
DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
DATE_FORMAT(run_dt, '%k') AS hour,
COUNT(*) AS calls,
SUM(total_time),
SUM(namelookup_time),
SUM(connect_time),
SUM(pretransfer_time),
SUM(starttransfer_time),
SUM(size_download)
FROM http_perf_raw_log
GROUP BY runner, site, machine, day, hour
ORDER BY `day` DESC
Make some modifications like changing field types, default values, adding a primary key, and perhaps add some indexes so that you can access and query this table in a fast way.
From then on, update the table like this:
START TRANSACTION;
DELETE FROM http_perf_stats;
INSERT INTO TABLE
SELECT dataset, runner, site, machine,
DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
DATE_FORMAT(run_dt, '%k') AS hour,
COUNT(*) AS calls,
SUM(total_time),
SUM(namelookup_time),
SUM(connect_time),
SUM(pretransfer_time),
SUM(starttransfer_time),
SUM(size_download)
FROM http_perf_raw_log
GROUP BY runner, site, machine, day, hour
ORDER BY `day` DESC;
COMMIT;
Several ways to do this:
Create a MySQL event (see http://dev.mysql.com/doc/refman/5.1/en/create-event.html) (that's how I would do it)
Create a cron job (unix-flavoured systems) or window scheduler task
Do a "lazy" update. When somebody requests this list, run the code above if the last time it was ran was longer than x minutes/hours ago. That way it works more like a cache. Slow on the first request, fast after. But you won't slow the server down unless somebody is interested in this.
The view is just another SELECT query, but abstracted away to make it easier querying the resultset. If the underlying SELECT is slow, so is the view. Reading through and summing together 1 GB of data in four seconds doesn't sound slow at all to me.
精彩评论