I have a MySQL Table that holds Google Analytics data:
CREATE TABLE IF NOT EXISTS `analytics_data` (
`ga_profile_id` int(11) NOT NULL,
`page` varchar(200) NOT NULL,
`source` varchar(150) NOT NULL,
`medium` varchar(50) NOT NULL,
`keyword` varchar(200) NOT NULL,
`bounces` int(11) NOT NULL,
`entrances` int(11) NOT NULL,
`exits` int(11) NOT NULL,
`new_visits` int(11) NOT NULL,
`page_views` int(11) NOT NULL,
`unique_page_views` int(11) NOT NULL,
`time_on_page` int(11) NOT NULL,
`visits` int(11) NOT NULL,
`date` date NOT NULL,
KEY `ga_profile_id` (`ga_profile_id`,`source`,`medium`,`date`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
I have a query to compute the sum of visitors based on a google analytics profile ID (ga_profile_id) over a given time period:
SELECT
SUM( `visits` ), ( UNIX_TIMESTAMP( `date` ) - 21600 ) * 1000 AS date
FROM `analytics_data`
WHERE
`date` >= '20开发者_JAVA技巧11-05-09' AND `date` <= '2011-06-08' AND `ga_profile_id` = [...]
GROUP BY `date`
We have 4.5 million records or so.
Index Data:
Type: BTREE
Fields/Cardinality:
ga_profile_id / 100
source / 10196
medium / 10196
date / 149893
EXPLAIN SELECT
- id: 1 - select_type: SIMPLE - table: analytics_data - type: ref - possible_keys: ga_profile_id - key: ga_profile_id - ref: const - rows: 219555 - extra: Using where; Using temporary; Using filesortAverage time for execution: 1 second.
We are on a virtual private server and most queries get executed in .0003 - 0.03 seconds. LONG queries (that I was going to optimize at some point) are generally .3 seconds.
I have tried adjusting the keys, ignoring some, changing some values and nothing seems to be affecting it in a positive way. Considering this is 1 of many queries on a page.
I am looking at changing MyISAM to memory -- any ideas are welcomed.
You need to create composite index ga_profile_id + date
in this particular order. And you'll get the best you could get with such query.
Further possible optimization is to pre-calculate sum of visits per date and use that for fast calculations.
I have a query to compute the sum of visitors based on a google analytics profile ID (ga_profile_id) over a given time period
It seems pretty optimized already... In your question at the time of writing this answer, you've stripped out the most interesting part of your query (the actual clause on ga_profile_id
), which is the most selective in all likelihood -- hence the current index usage.
At the very best, you'd manage to leverage an index on date
if you place it in a multicolumn index, e.g. (date, ga_profile_id)
or the other way around depending on your usage pattern and table statistics.
See indexes dos and donts.
Running indexes will be the first and easiest option but if that doesn't help I would suggest to look more into some fundamental DB management strategies like Table Partitioning.
@Kerry, look at Denis's solution... The only alternate to his offer is to have the index on PROFILE ID FIRST, THEN Date, otherwise, your index will be incorporated for anyone else too having action within the same time period...
In addition, @Bohemian's point of grouping down to the second is a strong point... you probably want to sort based on the DATE ONLY portion of a full date/time column result.
If you have typical date ranges in your query then you may consider to partitionate your table horizontally. Maybe it also helps when most of your data is "outdated" and you only have the "fresh" ones you need on one or more partitions and all these old ones on another. RANGE Partitioning
精彩评论