MySQL collation to store multilingual data of unknown language_问答_开发者

MySQL collation to store multilingual data of unknown language

开发者 https://www.devze.com 2023-01-27 16:28 出处：网络

I am new to multilingual data and my confession is that I never did tried it before. Currently I am working on a multilingual site, but I do not know whic开发者_如何学编程h language will be used.

Which collation/character set of MySQL should I use to achieve this?

Should I use some Unicode type of character set?

And of course these languages are not out of this universe, these must be in the set which we mostly use.

You should use a Unicode collation. You can set it by default on your system, or on each field of your tables. There are the following Unicode collation names, and this are their differences:

utf8_general_ci is a very simple collation. It just - removes all accents - then converts to upper case and uses the code of this sort of "base letter" result letter to compare.

utf8_unicode_ci uses the default Unicode collation element table.

The main differences are:

utf8_unicode_ci supports so called expansions and ligatures, for example: German letter ß (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL LIGATURE OE) is sorted near "OE".

utf8_general_ci does not support expansions/ligatures, it sorts all these letters as single characters, and sometimes in the wrong order.

utf8_unicode_ci is generally more accurate for all scripts. For example, on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian, Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in Belarusian, Macedonian, Serbian, and Ukrainian are not sorted well.

+/- The disadvantage of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.

So depending on, if you know or not, which specific languages/characters you are going to use I do recommend that you use utf8_unicode_ci which has a more ample coverage.

^{Extracted from MySQL forums.}

UTF-8 encompasses most languages, that's your safest bet. However, there are exceptions, and you need to make sure all languages you want to cover work in UTF-8. My experience with storing character sets MySQL doesn't understand, is that it will not be able to sort properly, but the data has remained intact as long as I read it out in the same character encoding I wrote it in.

UTF-8 is the character encoding, a way of storing a number. Which character is represented by which number is Unicode - an important distinction. Unicode has a large number of languages it covers and UTF-8 can encode them all (0 to 10FFFF, sort of), but Java can't handle all since the VM internal representation is a 16-bit character (not that you care about Java :).

You can insert any language text in MySQL Table by changing the Collation of the table Field to 'utf8_general_ci '.It is case insensitive.