Edited- MySQL. Large MyISAM table (40mln records) having index that is very slow and huge in size on disk_问答_开发者

The table contains about 40,000,000 records having:

CREATE TABLE `event` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `some_other_id_not_fk` int(10) unsigned default NOT NULL,
  `event_time` datetime NOT NULL,
  `radius` float default NULL,
  `how_heavy` smallint(6) default NULL,
  PRIMARY KEY  (`id`),
  KEY `event_some_other_id_not_fk` (`some_other_id_not_fk`),
  KEY `event_event_time` (`event_time`)
) ENGINE=MyISAM AUTO_INCREMENT=6506226 DEFAULT CHARSET=utf8

You should know that some_other_id_not_fk column is not big, it contains distinctively only 7 different numbers. The real pain is the event_time datetime column, as it contain开发者_JAVA技巧s extremely large amounts of different datetime's, and basicly everything is allowed: duplicates as well as unpredictably large time intervals without records to 'cover' them. You should also know that (some_other_id_not_fk,event_time) pair must be allowed to have duplicates either :( I know this causes even more problems :(

I've had some experience in optimizing MySQL tables, but such a huge pain had never appeared on my horizon :/

The current state of 'the things' is:

The selects by event_time between date1 and date2 (which I need to do) are satisfactorily fast. :)
My inserts are slow, I mean really SLOW!!! more then a 30 secs, and even worse: LOAD DATA procedures that temporary DISABLE and ENABLE KEYS are EXTREMELY slow(several hours), mainly on ENABLE keys operation.
The size of the index on the disk is 7 times bigger then the size of the data

I would have tried several different combinations of re-indexing till now, but the size of that data really prevents me from experimenting on indexes and columns drop/create at will.

Please help anyone had managed this ? Should using timestamp instead of datetime solve my problem? Or maybe I should add additional columns for day, year,... etc and index on them ?

`id` bigint(20) unsigned NOT NULL auto_increment,

Do you really need a BIGINT? You can probably get away with an INT. If you were to insert 1,000 rows per second 24 hours a day, it would take 136 years for you to exhaust all values in an unsigned 32-bit integer.

This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.

`some_other_id_not_fk` int(10) unsigned default NOT NULL,

You state this has only 7 distinct values. Does it need to be an INT type then? Could you use TINYINT instead? This will drastically reduce index size.

This will decrease the size of your table by 114.4 MB for 40 million rows, and will decrease the size of the some_other_id_not_fk index by approximately the same.

`event_time` datetime NOT NULL,

Do you need a DATETIME? DATETIME's take 8 bytes, a TIMESTAMP takes 4 bytes. If you can use a TIMESTAMP then this will drastically reduce data and index size. Be aware of the limitations of TIMESTAMP fields though such as Y2K38 and how they behave with respect to timezones and replication.

This change will decrease your table size by 152.5 MB for 40 million rows, and will decrease the size of your primary key index by 158.8 MB for 40 million rows.

These three changes will significantly reduce the size of your data as well as the indices.

Total Space Savings

Table: 152.5 + 152.5 + 114.4 = 419.4 MB
Index: 158.8 + 158.8 + ~115 = 432.6 MB

Total: 852MB

As others have suggested, you may not even need all the indices that you have defined. With such a low selectivity on some_other_id_not_fk there's a good chance the query optimizer won't even use that index and will instead opt for a full table scan. Dropping this index completely would result in a significant space savings for your indices.

If you could provide some sample queries, I can help you further.

Also, are you inserting into this table under a heavy read load? Keep in mind that SELECTs in MyISAM will block an INSERT.

Update

Most people are suggesting moving your some_other_id_not_fk field into the event_time index so the new index would be on (event_time, some_other_id_not_fk). I will recommend the same, but with an important caveat.

This index will be good for queries where you are filtering only on event_time, or if you filter on both event_time and some_other_id_not_fk. It will not be used for queries filtering only on some_other_id_not_fk - a full table scan will occur.

Moreover, if your queries are always filtering on both event_time and some_other_id_not_fk then do not use the index order of (event_time, some_other_id_not_fk). Rather, you should use the index (some_other_id_not_fk, event_time) instead.

Having the least selective (most duplicates) field first will allow for much greater compression for your index and thus a significantly reduced footprint on disk.

I think your intuition on what's heavy and what's not is backwards: an index with many repetitions of a few different options is much worse than an index with lots of distinct values and few repetitions of each.

My suggestion: drop the index on some_other_id_not_fk and keep (some_other_id_not_fk, event_time). This compound index should be 'almost unique', making insert overhead much lower. If possible, drop the event_time key also, unless you have queries that use that field without some_other_id_not_fk.

edit: you say that you have to select by time interval, then keep (event_time, some_other_id_not_fk) and drop both event_time and some_other_id_not_fk. if you have queries that use some_other_id_not_fk and not event_time, then keep both (event_time, some_other_id_not_fk) and (some_other_id_not_fk, event_time). the point is not have any index with few options. having an index with unused fields on the right is ok.

I think you don't need an index on some_other_id_not_fk (as you said there are just 7 different values, so selectivity of that index is 40,000,000/7 ) . All you need is 1 index on (event_time + [maybe] some_other_id_not_fk);

I had a similar situation arlier. I created a table with the same structure, let's call it the archive table. I copied the data from the active table to it every day at 3:00 and the deleted ALL the original.

The graphs and other statictics were selected from the archive table, the current events were logged to the active one.

Maybe it is not a best practice but worked sufficiently for me.

Partition table by time: Partitioning with Dates in MySQL 5.1 (Robin Schumacher)

http://dev.mysql.com/tech-resources/articles/mysql_5.1_partitioning_with_dates.html

I've dropped all indexes and made an index on (event_time, some_other_id_not_fk). I get the following performance indicators:

1Gb Data size on disk, 1.2Gb Index size on disk.
delete from event where event.event_time>STR_TO_DATE('20091201000000','%Y%m%d%H%i%s') and event.some_other_id_not_fk=4 | Affected rows: 353543 Time: 65.173 seconds
select * from event where event.event_time>STR_TO_DATE('20090401000000','%Y%m%d%H%i%s') and event.event_time<=STR_TO_DATE('20090401010000','%Y%m%d%H%i%s') and event.some_other_id_not_fk in (22,4,1,3) | 916 rows in set, Query time: 0.030 seconds
index enabled insertion of 350,000 new records using the following format: insert into event VALUES(...),(...),... | performed in about 30 seconds, Yeahaaaaaa :))
index disable - insertion - index enable - of 350,000 new records using the same format: insert into event VALUES(...),(...),... | performed in about 40 minutes. :) Looks like mysql default dump format, disabling index before inserts and re-enabling it after, is not always good for performance, especially when large size indexes are present:)

For now I am satisfied with this performance.

Last evening I've managed to create index only on (event_time). The size of the index was slightly lower than the first example. About 1.1Gb. The performance of same queries as listed above:

the delete | slightly faster, about 30 seconds
the select | slightly slower, about 0.1 seconds.
I've only tested index disable-enable insertiton of 350,000. It was very slow again | about 35 minutes.

I have rejected this state of the database, because I wasnt satisfied enough of the select speed, which is priority N1 for me.

hobodave, I am just curious, do you think that making the index on (some_other_id_not_fk,event_time) instead of (event_time,some_other_id_not_fk) will really change something dramatic towards better? My queries will ALWAYS filter on both fields. I shall NEVER have a query without filtering by some_other_id_not_fk. But I may have a queries that filter by IN(x,y,...) most of the distinct some_other_id_not_fk's. As I said, they are not many.

My priorities are:

select speed
insert speed
index size on disk (as the table will grow several times more)
... everything else

And I also wonder why such a huge index size is required 1.2Gb on 1Gb data? Index still greater then data. My logic suggests me that this kind of indexing of dates can be done in much smaller index? Am i correct? Is there something related to the index type which is probably BTREE?

Thank you. You're all great. I am closing the thread.