I recently went through an article on Character Encoding. I've a concern on a certain point mentioned there.
In the first figure, the author shows the characters, their code points in various character set开发者_如何转开发s and how they are encoded in various encoding formats.
For example the code point of é is E9
.
In ISO-8859-1
encoding it is represented as E9
.
In UTF-16
it is represented as 00 E9
.
But in UTF-8
it is represented using 2 bytes, C3 A9
.
My question is why is this required? It can be represented with 1 byte. Why are two bytes used? Can you please let me know?
UTF-8 uses the 2 high bits (bit 6 and bit 7) to indicate if there are any more bytes: Only the low 6 bits are used for the actual character data. That means that any character over 7F
requires (at least) 2 bytes.
A single byte can hold one of only 256 different values.
This means that an encoding that represents each character as a single byte, such as ISO-8859-1, cannot encode more than 256 different characters. This is why you can't use ISO-8859-1 to correctly write Arabic, or Japanese, or many other languages. There is only a limited amount of space available, and it is already used up by other characters.
UTF-8, on the other hand, needs to be capable of representing all of the millions of characters in Unicode. This makes it impossible to squeeze every single character into a single byte.
The designers of UTF-8 chose to make all of the ASCII characters (U+0000 to U+007F) representable with a single byte, and required all other characters to be stored as two or more bytes. If they had chosen to give more characters a single-byte representation, the encodings of other characters would have been longer and more complicated.
If you want a visual explanation of why bytes above 7F
don't represent the corresponding 8859-1 characters, look at the UTF-8 coding unit table on Wikipedia. You will see that every byte value outside the ASCII range either already has a meaning, or is illegal for historical reasons. There just isn't room in the table for bytes to represent their 8859-1 equivalents, and giving the bytes additional meanings would break several important properties of UTF-8.
Because many languages it 2 bit encoding that is simply not enough to encode all the letters of all alphabets Look 2 bit encoding 00 .. FF 15 ^ 2 = 255 characters 4 bit 0000 ... FFFF 4 ^ 15 = 50625
精彩评论