I have converted a web server access_log to a mysql table, it looks like this:
CREATE TABLE `access_log` (
`timestamp` int(11) NOT NULL default '0',
`visitorid` int(11) default NULL,
`url` int(11) default NULL,
`params` int(11) default NULL,
`status` smallint(3) NOT NULL default '0',
`bytes` int(20) NOT NULL default '0',
`referrer` int(11) default NULL,
`refparams` int(11) default NULL,
`useragentid` int(11) default NULL,
`keywords` int(11) default NULL,
`country` char(3) default '',
`crawl` int(1) NOT NULL default '0',
`sessionid` int(11) default NULL,
KEY `timestamp` (`timestamp`),
KEY `visitorid` (`visitorid`),
KEY `url` (`url`),
KEY `referrer` (`referrer`),
KEY `keywords` (`keywords`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 PACK_KEYS=1;
I have a query 开发者_开发问答that produces a "Most Popular Pages" report for a certain date range, an example looks like this:
select url,
count(distinct visitorid) as visitors,
count(*) as hits
from access_log where
timestamp >=1270072800 and timestamp <=1272664799
and crawl=0
group by url order by visitors desc limit 100;
This query gets pretty slow when there are a lot of records in the table.
Depending on the timestamp range relative to the total number of records in the table, the optimizer says it will either use the 'timestamp' or the 'url' key. But, it always mentions 'Using where; Using temporary; Using filesort'
Is there any way I could create a combined index that would improve the execution time of this query?
I've tried the following combinations, but the optimizer seems to ignore them:
- idx(timestamp,url,visitorid,crawl)
- idx(url,visitorid,crawl,timestamp)
Any suggestions or pointers as to what I'm missing would be appreciated.
Thanks!
So, you want to rank URLs by popularity in a given time period. Composite index on (URL, visitorid) would give you popularity. Composite index on (timestamp,url) would give you the urls visited in period. Why not try both indexes, and do a join against an inline-view, something like this (not sure of the exact syntax for inline views in mysql):
select distinct URL from log as Log1
where visitdatetime > x and visitdatetime< y
join
(select url, count(distinct visitorid) as DistinctVisitors
from log
group by url
-- having count(distinct visitorid) > {some cutoff value greater than 1}
-- try the composite index (url, visitorid, visitdate)
having vistdate > x and visitdate < y
) as Log2
on Log1.url = log2.url
order by DistinctVisitors desc
Partition your access log into multiple tables, and only run this query over the tables within your date range.
Make summary tables with the data pre-aggregated on daily/weekly/monthly bases to cut down on the amount of data that has to be crunched to produce a report. So after the day's log files are imported, aggregate the data by dividing timestamp down to hour boundaries, then down to day boundaries etc.
精彩评论