How to create an effective index for following query_问答_开发者

How to create an effective index for following query

开发者 https://www.devze.com 2023-01-28 22:21 出处：网络

I have converted a web server access_log to a mysql table, it looks like this: CREATE TABLE `access_log` (

相关专题：sql

I have converted a web server access_log to a mysql table, it looks like this:

CREATE TABLE `access_log` (
`timestamp` int(11) NOT NULL default '0',
`visitorid` int(11) default NULL,
`url` int(11) default NULL,
`params` int(11) default NULL,
`status` smallint(3) NOT NULL default '0',
`bytes` int(20) NOT NULL default '0',
`referrer` int(11) default NULL,
`refparams` int(11) default NULL,
`useragentid` int(11) default NULL,
`keywords` int(11) default NULL,
`country` char(3) default '',
`crawl` int(1) NOT NULL default '0',
`sessionid` int(11) default NULL,
KEY `timestamp` (`timestamp`),
KEY `visitorid` (`visitorid`),
KEY `url` (`url`),
KEY `referrer` (`referrer`),
KEY `keywords` (`keywords`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 PACK_KEYS=1;

I have a query 开发者_开发问答that produces a "Most Popular Pages" report for a certain date range, an example looks like this:

select url,
count(distinct visitorid) as visitors,
count(*) as hits 
from access_log where 
timestamp >=1270072800 and timestamp <=1272664799 
and crawl=0 
group by url order by visitors desc limit 100;

This query gets pretty slow when there are a lot of records in the table.

Depending on the timestamp range relative to the total number of records in the table, the optimizer says it will either use the 'timestamp' or the 'url' key. But, it always mentions 'Using where; Using temporary; Using filesort'

Is there any way I could create a combined index that would improve the execution time of this query?

I've tried the following combinations, but the optimizer seems to ignore them:

idx(timestamp,url,visitorid,crawl)
idx(url,visitorid,crawl,timestamp)

Any suggestions or pointers as to what I'm missing would be appreciated.

Thanks!

So, you want to rank URLs by popularity in a given time period. Composite index on (URL, visitorid) would give you popularity. Composite index on (timestamp,url) would give you the urls visited in period. Why not try both indexes, and do a join against an inline-view, something like this (not sure of the exact syntax for inline views in mysql):

       select distinct URL from log as Log1
                  where visitdatetime > x and visitdatetime< y


       join

       (select url, count(distinct visitorid) as DistinctVisitors
       from log
       group by url
        -- having count(distinct visitorid) > {some cutoff value greater than 1}
        -- try the composite index (url, visitorid, visitdate)
       having vistdate > x and visitdate < y 
       ) as Log2


       on Log1.url = log2.url

       order by DistinctVisitors desc

Partition your access log into multiple tables, and only run this query over the tables within your date range.

Make summary tables with the data pre-aggregated on daily/weekly/monthly bases to cut down on the amount of data that has to be crunched to produce a report. So after the day's log files are imported, aggregate the data by dividing timestamp down to hour boundaries, then down to day boundaries etc.