开发者

Best way to have unique key over 500M varchar(255) records in mysql/innodb?

开发者 https://www.devze.com 2022-12-23 05:58 出处:网络
I have url column with unique key over it - but its performance on updates is absolutely atrocious. I suspect that\'s because the index doesn\'t all fit in memory.

I have url column with unique key over it - but its performance on updates is absolutely atrocious. I suspect that's because the index doesn't all fit in memory.

So I was thinking, how about adding a column of md5(url) with 16 bytes of binary data and unique-keying that instead.

What would be the best datatype for that? I'd love to be able to just see 32-character hex hash, while mysql would convert it to/from 16 binary bytes and index that, as programs using the database might have some troubles with arbitrary binary data that I'd rather avoid if possible (also I'm a bit 开发者_如何学编程afraid that mysql might get some strange ideas about character sets and for example overalocating storage for that by 3:1 because it thinks it might need utf8, how do I avoid that for cure?).

It seems some kind of solution would be binary(16) null for storage, unhex(md5(url)) for insertion/comparison, hex(url_hash) for retrieval (not that it really needs retrieval, there will be unindexed url column there anyway). Is this the best way?


MD5 is not guaranteed unique, so you cannot create a unique index on it unless your business model permits you to flat out deny insertions and updates in case of collison. Is this the case? I am asking because working around collisions (no matter how unlikely) will prove extremely complex from a performance standpoint.

In any case, I find it hard to believe (not to say that it may not turn out to be true) that a properly structured query, properly planned by MySQL to use the proper index (even over 500M rows), would have to suffer from atrocious performance -- but then again it is hard to tell without knowing what your query looks like and what your numbers are.

If I were you, before even considering a workaround (such as the MD5 approach) to existing index lookup I'd make absolutely sure of where my problem really lies:

  • use EXPLAIN to confirm that your UPDATE statement is indeed using the proper index
    • you cannot EXPLAIN an UPDATE statement, but you can EXPLAIN its equivalent SELECT statement (you essentially care about the WHERE clause, JOINs etc.)
    • even with 500M rows, a btree index should only require a handful of pages per matching row
      • how many rows do you expect each one of your UPDATE statements to update? how many rows are actually updated?
      • do you have additional conditions in your WHERE clause in addition to url=? The planner may pick the less selective index first and thrash your cache -- find out from the EXPLAIN plan
    • when you actually run (not EXPLAIN) them: is UPDATE systematically slower than its corresponding SELECT? you may be having a write bottleneck, possibly due to locking issues. how many sessions are active at the time of a slow UPDATE? how many indices defined on your table incorporate the url column?
    • have you analyzed your table recently?

So anyway, before proceeding, please let us know:

  • are you doing bulk UPDATE? how many UPDATE second (or how many milliseconds per UPDATE) would meet your performance requirements?
  • how many sessions are active at the time of the UPDATE?
  • have you analyzed your table?
  • what would be a sample UPDATE query? (please provide specific values for its parameters)
  • what is the explain plan for the corresponding SELECT? (using the same specific values)
  • how long does the corresponding SELECT (using the same specific values) actually take to complete when executed (not EXPLAINed), and which row(s) did it actually return?
  • how long does the actual UPDATE (using the same specific values) take when executed? (not EXPLAINed)


I'm not familiar with MySQL specifically - but my guess is that unique index is a clustered index (meaning the data pages are ordered with it). When you update, you cause a reorganization of the entire table.

If you can move the clustered index to some stable value, then that should solve your problem.


If you are only using the index to guarantee uniqueness and not for retrieval, it's probably a win to use an MD5 in a binary(16) not null column. That way you could potentially have hundreds of keys in an index page, reducing the number of disk seeks for each insert.

An alternative is to use compression in your table by creating it like this:

CREATE TABLE foo (url varchar(255)) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4;

Since URLs should compress pretty well, it could be just as big of a win as your hash idea and not require any extra code on your part.

Here's the InnoDB reference on compression: http://www.innodb.com/doc/innodb_plugin-1.0/innodb-compression.html


The index is probably already using a hash, in a more efficient way than your hand-made MD5 solution would.

0

精彩评论

暂无评论...
验证码 换一张
取 消