开发者

which mysql utf8 collation to use for data coming from latin1

开发者 https://www.devze.com 2023-03-15 09:07 出处:网络
I have a lot of data in mysql in latin1 format, and I\'m trying to convert everything to utf8, but so far I didn\'t find the right collation to convert it correctly.

I have a lot of data in mysql in latin1 format, and I'm trying to convert everything to utf8, but so far I didn't find the right collation to convert it correctly. Some of the data are international names, with a lot of accent, and characters from va开发者_运维问答rious language and a primary key on these fields. I have a simple test case that goes like this:

CREATE TABLE utf8_test ( value varchar(30), PRIMARY KEY(value) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci;
INSERT INTO utf8_test VALUES ('e');
INSERT INTO utf8_test VALUES ('é');
INSERT INTO utf8_test VALUES ('è');
INSERT INTO utf8_test VALUES ('ü');
INSERT INTO utf8_test VALUES ('u');
INSERT INTO utf8_test VALUES ('y');
INSERT INTO utf8_test VALUES ('ÿ');

It's a simple test to demonstrate the problem, but real data is not limited to this.

So far, only utf8_bin can accept everything without duplicate error, but I can't use it, because it's case-sensitive. Am I missing something ?

note: some of the tables have more than a dozen millions rows, so performance is a factor.


As far as I know, MySQL does not yet have case sensitive utf8 collations, so utf8_bin is about your only choice. utf8_bin is case sensitive, and treats all the diacritical marks as being separate. As Blair Conrad pointed out, it can make a bit of a mess of sorting. You may be able to solve that by using collate in your SQL statements, although if you have a lot of SQL statements, this approach could be problematic.

CREATE TABLE `utf8_test` (
  `value` varchar(30) COLLATE utf8_bin NOT NULL DEFAULT '',
  PRIMARY KEY (`value`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

INSERT INTO utf8_test VALUES ('e');
INSERT INTO utf8_test VALUES ('é');
INSERT INTO utf8_test VALUES ('è');
INSERT INTO utf8_test VALUES ('ü');
INSERT INTO utf8_test VALUES ('u');
INSERT INTO utf8_test VALUES ('y');
INSERT INTO utf8_test VALUES ('ÿ');

INSERT INTO utf8_test VALUES ('E');
INSERT INTO utf8_test VALUES ('É');
INSERT INTO utf8_test VALUES ('È');
INSERT INTO utf8_test VALUES ('Ü');
INSERT INTO utf8_test VALUES ('U');
INSERT INTO utf8_test VALUES ('Y');
INSERT INTO utf8_test VALUES ('Ÿ');

SELECT value FROM utf8_test WHERE value = 'E';
+-------+
| value |
+-------+
| E     |
+-------+

SELECT value FROM utf8_test WHERE value COLLATE utf8_unicode_ci = 'E';

+-------+
| value |
+-------+
| E     |
| e     |
| È     |
| É     |
| è     |
| é     |
+-------+

SELECT value FROM utf8_test ORDER BY value;

+-------+
| value |
+-------+
| E     |
| U     |
| Y     |
| e     |
| u     |
| y     |
| È     |
| É     |
| Ü     |
| è     |
| é     |
| ü     |
| ÿ     |
| Ÿ     |
+-------+

SELECT value FROM utf8_test ORDER BY value COLLATE utf8_unicode_ci;

+-------+
| value |
+-------+
| E     |
| é     |
| è     |
| É     |
| È     |
| e     |
| u     |
| Ü     |
| U     |
| ü     |
| y     |
| Y     |
| ÿ     |
| Ÿ     |
+-------+


The correct collation depends on more than the source character set (Latin-1). It depends on the language and locale that the data came from. If you're just worried about making sure that no characters are folded one into another, utf8_bin is probably the way to go.

You may experience unexpected sorting, though, where the characters are sorted according to the UTF-8 byte values, rather than according to any particular language or regional preferences.

In order to get a case-insensitive collation, you may want to try utf8_general_ci or utf8_unicode_ci, as described in UTF-8: General? Bin? Unicode?

Edit Further research suggests that there's currently no collation that you can use. See http://bugs.mysql.com/bug.php?id=19567.

How's this for an option? Assign utf8_bin to your column(s) but apply a different collation (one that will fold case and accents) on your queries? You won't lose any data, and your queries will be case-insensitive. If having accent-insensitive queries don't hurt you, it could be a compromise...

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号