开发者

mysql view with group by - performance problem

开发者 https://www.devze.com 2023-02-28 15:58 出处:网络
I have a table which collects data for web pages performance. There are multiple machines, testing multiple sites in 10 minutes intervals, so currently I have about 700 000 rows (920 MB) with +/- 50 0

I have a table which collects data for web pages performance. There are multiple machines, testing multiple sites in 10 minutes intervals, so currently I have about 700 000 rows (920 MB) with +/- 50 000 new rows daily.

Table source:

SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";

CREATE TABLE `http_perf_raw_log` (
  `run_dt` int(11) DEFAULT NULL,
  `dataset` varchar(64) DEFAULT NULL,
  `runner` varchar(64) DEFAULT NULL,
  `site` varchar(128) DEFAULT NULL,
  `machine` varchar(32) DEFAULT NULL,
  `called_url` varchar(1024) DEFAULT NULL,
  `method` varchar(8) DEFAULT NULL,
  `url` varchar(1024) DEFAULT NULL,
  `content_type` varchar(64) DEFAULT NULL,
  `http_code` int(11) DEFAULT NULL,
  `header_size` int(11) DEFAULT NULL,
  `request_size` int(11) DEFAULT NULL,
  `filetime` int(11) DEFAULT NULL,
  `ssl_verify_result` int(11) DEFAULT NULL,
  `redirect_count` int(11) DEFAULT NULL,
  `total_time` decimal(6,4) DEFAULT NULL,
  `namelookup_time` decimal(6,4) DEFAULT NULL,
  `connect_time` decimal(6,4) DEFAULT NULL,
  `pretransfer_time` decimal(6,4) DEFAULT NULL,
  `starttransfer_time` decimal(6,4) DEFAULT NULL,
  `redirect_time` decimal(6,4) DEFAULT NULL,
  `size_upload` int(11) DEFAULT NULL,
  `size_download` int(11) DEFAULT NULL,
  `speed_download` int(11) DEFAULT NULL,
  `speed_upload` int(11) DEFAULT NULL,
  `download_content_length` int(11) DEFAULT NULL,
  `upload_content_length` int(11) DEFAULT NULL,
  `certinfo` varchar(1024) DEFAULT NULL,
  `request_header` varchar(1024) DEFAULT NULL,
  `return_content` varchar(4096) DEFAULT NULL,
  `return_headers` varchar(2048) DEFAULT NULL,
  KEY `run_dt_idx` (`run_dt`),
  KEY `dataset_idx` (`dataset`),
  KEY `runner_idx` (`runner`),
  KEY `site_idx` (`site`),
  KEY `machine_idx` (`machine`),
  KEY `total_time_idx` (`total_time`)
) ENGINE=MyISA开发者_开发百科M DEFAULT CHARSET=latin1;

For aggregating stats (with 1 hour resolution), I created a view:

CREATE OR REPLACE VIEW http_perf_stats (dataset, runner, site, machine, day, hour, calls, total_time, namelookup_time, connect_time, pretransfer_time, starttransfer_time, size_download) AS 
SELECT dataset,  runner, site, machine,
 DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
 DATE_FORMAT(run_dt, '%k') AS hour,
 COUNT(*) AS calls,
 SUM(total_time),
 SUM(namelookup_time), 
 SUM(connect_time),
 SUM(pretransfer_time), 
 SUM(starttransfer_time), 
 SUM(size_download)
FROM http_perf_raw_log GROUP BY runner, site, machine, day, hour ORDER BY `day` DESC

But the performance of VIEW (and underlying SELECT) is terrible - takes about 4 seconds.

So, my questions:

1. Is using GROUP BY in a VIEW good idea at all? And if not, what is better alternative?

2. Is there ( I imagine yes, I am not SQL expert :/) a way to optimize this SELECT (changing query or structure of http_perf_raw_log)?


Remove the GROUP BY from the VIEW and use it in the SELECT that calls the VIEW.


In this case it might be a good idea to only create statistics periodically (once per hour for example).

I'd do that as follows. Run the following code once to create a table structure.

CREATE TABLE http_perf_stats AS 
SELECT dataset,  runner, site, machine,
 DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
 DATE_FORMAT(run_dt, '%k') AS hour,
 COUNT(*) AS calls,
 SUM(total_time),
 SUM(namelookup_time), 
 SUM(connect_time),
 SUM(pretransfer_time), 
 SUM(starttransfer_time), 
 SUM(size_download)
FROM http_perf_raw_log 
GROUP BY runner, site, machine, day, hour 
ORDER BY `day` DESC

Make some modifications like changing field types, default values, adding a primary key, and perhaps add some indexes so that you can access and query this table in a fast way.

From then on, update the table like this:

START TRANSACTION;

    DELETE FROM http_perf_stats;

    INSERT INTO TABLE 
      SELECT dataset,  runner, site, machine,
        DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
        DATE_FORMAT(run_dt, '%k') AS hour,
        COUNT(*) AS calls,
        SUM(total_time),
        SUM(namelookup_time), 
        SUM(connect_time),
        SUM(pretransfer_time), 
        SUM(starttransfer_time), 
        SUM(size_download)
      FROM http_perf_raw_log 
      GROUP BY runner, site, machine, day, hour 
      ORDER BY `day` DESC;

COMMIT;

Several ways to do this:

  • Create a MySQL event (see http://dev.mysql.com/doc/refman/5.1/en/create-event.html) (that's how I would do it)

  • Create a cron job (unix-flavoured systems) or window scheduler task

  • Do a "lazy" update. When somebody requests this list, run the code above if the last time it was ran was longer than x minutes/hours ago. That way it works more like a cache. Slow on the first request, fast after. But you won't slow the server down unless somebody is interested in this.


The view is just another SELECT query, but abstracted away to make it easier querying the resultset. If the underlying SELECT is slow, so is the view. Reading through and summing together 1 GB of data in four seconds doesn't sound slow at all to me.

0

精彩评论

暂无评论...
验证码 换一张
取 消