I'm really confused about UTF in Unicode.
there is UTF-8, UTF-16 and UTF-32.
my question is :
what UTF that are support all Unicode blocks ?
What is the best UTF(performance, size, etc), and why ?
What is different between these three UTF ?
what is endianness and byte order marks (BOM) ?
Thanks
what UTF that are support all Unicode blocks ?
All UTF encodings support all Unicode blocks - there is no UTF encoding that can't represent any Unicode codepoint. However, some non-UTF, older encodings, such as UCS-2 (which is like UTF-16, but lacks surrogate pairs, and thus lacks the ability to encode codepoints above 65535/U+FFFF), may not.
What is the best UTF(performance, size, etc), and why ?
For textual data that is mostly English and/or just ASCII, UTF-8 is by far the most space-efficient. However, UTF-8 is sometimes less space-efficient than UTF-16 and UTF-32 where most of the codepoints used are high (such as large bodies of CJK text).
What is different between these three UTF ?
UTF-8 encodes each Unicode codepoint from one to four bytes. The Unicode values 0 to 127, which are the same as they are in ASCII, are encoded like they are in ASCII. Bytes with values 128 to 255 are used for multi-byte codepoints.
UTF-16 encodes each Unicode codepoint in either two bytes (one UTF-16 value) or four bytes (two UTF-16 values). Anything in the Basic Multilingual Plane (Unicode codepoints 0 to 65535, or U+0000 to U+FFFF) are encoded with one UTF-16 value. Codepoints from higher plains use two UTF-16 values, through a technique called 'surrogate pairs'.
UTF-32 is not a variable-length encoding for Unicode; all Unicode codepoint values are encoded as-is. This means that U+10FFFF
is encoded as 0x0010FFFF
.
what is endianness and byte order marks (BOM) ?
Endianness is how a piece of data, particular CPU architecture or protocol orders values of multi-byte data types. Little-endian systems (such as x86-32 and x86-64 CPUs) put the least-significant byte first, and big-endian systems (such as ARM, PowerPC and many networking protocols) put the most-significant byte first.
In a little-endian encoding or system, the 32-bit value 0x12345678
is stored or transmitted as 0x78 0x56 0x34 0x12
. In a big-endian encoding or system, it is stored or transmitted as 0x12 0x34 0x56 0x78
.
A byte order mark is used in UTF-16 and UTF-32 to signal which endianness the text is to be interpreted as. Unicode does this in a clever way -- U+FEFF is a valid codepoint, used for the byte order mark, while U+FFFE is not. Therefore, if a file starts with 0xFF 0xFE
, it can be assumed that the rest of the file is stored in a little-endian byte ordering.
A byte order mark in UTF-8 is technically possible, but is meaningless in the context of endianness for obvious reasons. However, a stream that begins with the UTF-8 encoded BOM almost certainly implies that it is UTF-8, and thus can be used for identification because of this.
Benefits of UTF-8
- ASCII is a subset of the UTF-8 encoding and therefore is a great way to introduce ASCII text into a 'Unicode world' without having to do data conversion
- UTF-8 text is the most compact format for ASCII text
- Valid UTF-8 can be sorted on byte values and result in sorted codepoints
Benefits of UTF-16
- UTF-16 is easier than UTF-8 to decode, even though it is a variable-length encoding
- UTF-16 is more space-efficient than UTF-8 for characters in the BMP, but outside ASCII
Benefits of UTF-32
- UTF-32 is not variable-length, so it requires no special logic to decode
“Answer me these questions four, as all were answered long before.”
You really should have asked one question, not four. But here are the answers.
All UTF transforms by definition support all Unicode code points. That is something you needn’t worry about. The only problem is that some systems are really UCS-2 yet claim they are UTF-16, and UCS-2 is severely broken in several fundamental ways:
- UCS-2 is not a valid Unicode encoding.
- UCS-2 supports only ¹⁄₁₇ᵗʰ of Unicode. That is, Plane 0 only, not Planes 1–16.
- UCS-2 permits code points that The Unicode Standard guarantees will never be in a valid Unicode stream. These include
- all 2,048 UTF-16 surrogates, code points U+D800 through U+DFFF
- the 32 non-character code points between U+FDD0 and U+FDEF
- both sentinels at U+FFEF and U+FFFF
For what encoding is used internally by seven different programming languages, see slide 7 on Feature Support Summary in my OSCON talk from last week entitled “Unicode Support Shootout”. It varies a great deal.
UTF-8 is the best serialization transform of a stream of logical Unicode code points because, in no particular order:
- UTF-8 is the de facto standard Unicode encoding on the web.
- UTF-8 can be stored in a null-terminated string.
- UTF-8 is free of the vexing BOM issue.
- UTF-8 risks no confusion of UCS-2 vs UTF-16.
- UTF-8 compacts mainly-ASCII text quite efficiently, so that even Asian texts that are in XML or HTML often wind up being smaller in bytes than UTF-16. This is an important thing to know, because it is a counterintuitive and surprising result. The ASCII markup tags often make up for the extra byte. If you are really worried about storage, you should be using proper text compression, like LZW and related algorithms. Just bzip it.
- If need be, it can be roped into use for trans-Unicodian points of arbitrarily large magnitude. For example, MAXINT on a 64-bit machine becomes 13 bytes using the original UTF-8 algorithm. This property is of rare usefulness, though, and must be used with great caution lest it be mistaken for a legitimate UTF-8 stream.
I use UTF-8 whenever I can get away with it.
I have already given properties of UTF-8, so here are some for the other two:
- UTF-32 enjoys a singular advantage for internal storage: O(1) access to code point N. That is, constant time access when you need random access. Remember we lived forever with O(N) access in C’s
strlen
function, so I am not sure how important this is. My impression is that we almost always process our strings in sequential not random order, in which case this ceases to be a concern. Yes, it takes more memory, but only marginally so in the long run. - UTF-16 is a terrible format, having all the disadvantages of UTF-8 and UTF-32 but none of the advantages of either. It is grudgingly true that when properly handled, UTF-16 can certainly be made to work, but doing so takes real effort, and your language may not be there to help you. Indeed, your language is probably going to work against you instead. I’ve worked with UTF-16 enough to know what a royal pain it is. I would stay clear of both these, especially UTF-16, if you possibly have any choice in the matter. The language support is almost never there, because there are massive pods of hysterical porpoises all contending for attention. Even when proper code-point instead of code-unit access mechanisms exist, these are usually awkward to use and lengthy to type, and they are not the default. This leads too easily to bugs that you may not catch until deployment; trust me on this one, because I’ve been there.
That’s why I’ve come to talk about there being a UTF-16 Curse. The only thing worse than The UTF-16 Curse is The UCS-2 Curse.
- UTF-32 enjoys a singular advantage for internal storage: O(1) access to code point N. That is, constant time access when you need random access. Remember we lived forever with O(N) access in C’s
Endianness and the whole BOM thing are problems that curse both UTF-16 and UTF-32 alike. If you use UTF-8, you will not ever have to worry about these.
I sure do hope that you are using logical (that is, abstract) code points internally with all your APIs, and worrying about serialization only for external interchange alone. Anything that makes you get at code units instead of code points is far far more hassle than it’s worth, no matter whether those code units are 8 bits wide or 16 bits wide. You want a code-point interface, not a code-unit interface. Now that your API uses code points instead of code units, the actual underlying representation no longer matters. It is important that this be hidden.
Category Errors
Let me add that everyone talking about ASCII versus Unicode is making a category error. Unicode is very much NOT “like ASCII but with more characters.” That might describe ISO 10646, but it does not describe Unicode. Unicode is not merely a particular repertoire but rules for handling them. Not just more characters, but rather more characters that have particular rules accompanying them. Unicode characters without Unicode rules are no longer Unicode characters.
If you use an ASCII mindset to handle Unicode text, you will get all kinds of brokenness, again and again. It doesn’t work. As just one example of this, it is because of this misunderstanding that the Python pattern-matching library, re
, does the wrong thing completely when matching case-insensitively. It blindly assumes two code points count as the same if both have the same lowercase. That is an ASCII mindset, which is why it fails. You just cannot treat Unicode that way, because if you do you break the rules and it is no longer Unicode. It’s just a mess.
For example, Unicode defines U+03C3 GREEK SMALL LETTER SIGMA
and U+03C2 GREEK SMALL LETTER FINAL SIGMA
as case-insensitive versions of each other. (This is called Unicode casefolding.) But since they don’t change when blindly mapped to lowercase and compared, that comparison fails. You just can’t do it that way. You can’t fix it in the general case by switching the lowercase comparison to an uppercase one, either. Using casemapping when you need to use casefolding belies a shakey understanding of the whole works.
(And that’s nothing: Python 2 is broken even worse. I recommend against using Python 2 for Unicode; use Python 3 if you want to do Unicode in Python. For Pythonistas, the solution I recommend for Python’s innumerably many Unicode regex issues is Matthew Barnett’s marvelous regex
library for Python 2 and Python 3. It is really quite neat, and it actually gets Unicode casefolding right — amongst many other Unicode things that the standard re
gets miserably wrong.)
REMEMBER: Unicode is not just more characters: Unicode is rules for handling more characters. One either learns to work with Unicode, or else one works against it, and if one works against it, then it works against you.
All of them support all Unicode code points.
They have different performance characteristics - for example, UTF-8 is more compact for ASCII characters, whereas UTF-32 makes it easier to deal with the whole of Unicode including values outside the Basic Multilingual Plane (i.e. above U+FFFF). Due to its variable width per character, UTF-8 strings are hard to use to get to a particular character index in the binary encoding - you have scan through. The same is true for UTF-16 unless you know that there are no non-BMP characters.
It's probably easiest to look at the wikipedia articles for UTF-8, UTF-16 and UTF-32
Endianness determines (for UTF-16 and UTF-32) whether the most significant byte comes first and the least significant byte comes last, or vice versa. For example, if you want to represent U+1234 in UTF-16, that can either be { 0x12, 0x34 } or { 0x34, 0x12 }. A byte order mark indicates which endianess you're dealing with. UTF-8 doesn't have different endiannesses, but seeing a UTF-8 BOM at the start of a file is a good indicator that it is UTF-8.
Some good questions here and already a couple good answers. I might be able to add something useful.
As said before, all three cover the full set of possible codepoints, U+0000 to U+10FFFF.
Depends on the text, but here are some details that might be of interest. UTF-8 uses 1 to 4 bytes per char; UTF-16 uses 2 or 4; UTF-32 always uses 4. A useful thing to note is this. If you use UTF-8 then then English text will be encoded with the vast majority of characters in one byte each, but Chinese needs 3 bytes each. Using UTF-16, English and Chinese will both require 2. So basically UTF-8 is a win for English; UTF-16 is a win for Chinese.
The main difference is mentioned in the answer to #2 above, or as Jon Skeet says, see the Wikipedia articles.
Endianness: For UTF-16 and UTF-32 this refers to the order in which the bytes appear; for example in UTF-16, the character U+1234 can be encoded either as 12 34 (big endian), or 34 12 (little endian). The BOM, or byte order mark is interesting. Let's say you have a file encoded in UTF-16, but you don't know whether it is big or little endian, but you notice the first two bytes of the file are FE FF. If this were big-endian the character would be U+FEFF; if little endian, it would signify U+FFFE. But here's the thing: In Unicode the codepoint FFFE is permanently unassigned: there is no character there! Therefore we can tell the encoding must be big-endian. The FEFF character is harmless here; it is the ZERO-WIDTH NO BREAK SPACE (invisible, basically). Similarly if the file began with FF FE we know it is little endian.
Not sure if I added anything to the other answers, but I have found the English vs. Chinese concrete analysis useful in explaining this to others in the past.
One way of looking at it is as size over complexity. Generally they increase in the number of bytes they need to encode text, but decrease in the complexity of decoding the scheme they use to represent characters. Therefore, UTF-8 is usually small but can be complex to decode, whereas UTF-32 takes up more bytes but is easy to decode (but is rarely used, UTF-16 being more common).
With this in mind UTF-8 is often chosen for network transmission, as it has smaller size. Whereas UTF-16 is chosen where easier decoding is more important than storage size.
BOMs are intended as information at the beginning of files which describes which encoding has been used. This information is often missing though.
Joel Spolsky wrote a nice introductory article about Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
精彩评论