So I have a large table with a bit more than 2 billion records, and 5 multi-column keys.
There are two methods I can use for inserting data:
Method 1
load data infile ...;
Method 2
alter table disable keys;
load data infile ...;
alter table enable keys;
If I'm starting from an empty table, for 2 billion records, method 1 takes about 60 hours (estimated, may be more), while method 2 takes 12 hours to insert the data, and 3 hours recreating the keys. So far so good.
However, if I already have my 2 billion records, and attempt to in开发者_如何学运维sert an additional 5 million, method 1 takes about 3 hours, while method 2 takes 30 minutes inserting the data, and a whopping 7 hours recreating the keys. I confirmed that during the entire key regeneration, it used Repair by sorting
, so it's not like it fell back to Repair with keycache
.
I wonder why this is. MySQL claims that disabling keys is very good for inserting bulk data, but this is obviously dependent on the context. If it is about to regenerate all keys from scratch, why doesn't it take around 3 hours, as when I started with an empty table? or if it inserts keys one by one, why doesn't it take around 3 hours, which is what it took for method 1?
Comments are welcome
The time taken can vary quite a bit apparently.
http://www.mysqlperformanceblog.com/2007/07/05/working-with-large-data-sets-in-mysql/
If you're working with billions of records, and using MySQL 5.1 or above, then you might find partitioning will benefit performance... when working with indexes in a partitioned table, indexes are also partitioned; and because each index is only built against a partitiion/subset of your total data, the sorting overheads of rebuilding should be significantly less.
"not as fast as promised" - uh, you have 5000000 records, of course it will take a bit longer than inserting 20 records.
- With the first method, it is changing the indexes a little bit on every row insert, so they are always consistent with the data.
- With the second method, it is rebuilding the indexes by sorting the whole table (2005000000 rows) - which means it's moving a large amount of existing index data to and fro (disk speed is a likely to emerge as a bottleneck here), which depends on 1) amount of existing data, and 2) amount of new data.
- You could use method 3: drop keys before the second insert (this could take some time, too) and recreate them afterwards. I suspect the time will be similar to recreating the keys after the initial insert
The speeds you are describing are quite reasonable IMHO - just use the fastest method.
精彩评论