I have table with words dictionary in my language (latvian).
CREATE TABLE words ( value varchar开发者_高级运维(255) COLLATE utf8_unicode_ci DEFAULT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
And let's say it has 3 words inside:
INSERT INTO words (value) VALUES ('tēja');
INSERT INTO words (value) VALUES ('vējš');
INSERT INTO words (value) VALUES ('feja');
What I want to do is I want to find all words that is exactly 4 characters long and where second character is 'ē' and third character is 'j'
For me it feels that correct query would be:
SELECT * FROM words WHERE value LIKE '_ēj_';
But problem with this query is that it returs not 2 entries ('tēja','vējš') but all three.
As I understand it is because internally MySQL converts strings to some ASCII representation?
Then there is BINARY
addition possible for LIKE
SELECT * FROM words WHERE value LIKE BINARY '_ēj_';
But this also does not return 2 entries ('tēja','vējš') but only one ('tēja'). I believe this has something to do with UTF-8 2 bytes for non ASCII chars?
So question:
What MySQL query would return my exact two words ('tēja','vējš')?Thank you in advance
What MySQL query would return my exact two words ('tēja','vējš')?
SELECT * FROM words WHERE value LIKE '_ēj_' COLLATE utf8_bin;
The utf8_bin
collation is not just diacritical-sensitive, but also case-sensitive. If you want to match only the letter-with-diacritical and you don't care about upper/lower case, you would have to find a utf_..._ci
collation that doesn't treat e
and ē
as the same letter.
I can't immediately see one (there are plenty that don't collate ē
at all, which would be okay if you only need case-sensitive matching on the non-diacritical letters). Interesting that the Latvian collation treats macron-letters as the same as plain letters, which you don't want (it knows š
is different from s
).
Anyway, whatever collation you end up with, you will want to put your tables in that collation rather than manually specifying it in a query, so that comparisons can be properly indexed.
You have to use proper collation.
Dunno for the latvian but here is the example for the german: http://dev.mysql.com/doc/refman/5.0/en/charset-collation-effect.html
to give you an idea
You can try some of the baltic collations
精彩评论