开发者

Why does UTF-8 use more than one byte to represent some characters?

开发者 https://www.devze.com 2023-03-29 08:05 出处:网络
I recently went through an article on Character Encoding. I\'ve a concern on a certain point mentioned there.

I recently went through an article on Character Encoding. I've a concern on a certain point mentioned there.

In the first figure, the author shows the characters, their code points in various character set开发者_如何转开发s and how they are encoded in various encoding formats. For example the code point of é is E9. In ISO-8859-1 encoding it is represented as E9. In UTF-16 it is represented as 00 E9. But in UTF-8 it is represented using 2 bytes, C3 A9.

My question is why is this required? It can be represented with 1 byte. Why are two bytes used? Can you please let me know?


UTF-8 uses the 2 high bits (bit 6 and bit 7) to indicate if there are any more bytes: Only the low 6 bits are used for the actual character data. That means that any character over 7F requires (at least) 2 bytes.


A single byte can hold one of only 256 different values.

This means that an encoding that represents each character as a single byte, such as ISO-8859-1, cannot encode more than 256 different characters. This is why you can't use ISO-8859-1 to correctly write Arabic, or Japanese, or many other languages. There is only a limited amount of space available, and it is already used up by other characters.

UTF-8, on the other hand, needs to be capable of representing all of the millions of characters in Unicode. This makes it impossible to squeeze every single character into a single byte.

The designers of UTF-8 chose to make all of the ASCII characters (U+0000 to U+007F) representable with a single byte, and required all other characters to be stored as two or more bytes. If they had chosen to give more characters a single-byte representation, the encodings of other characters would have been longer and more complicated.

If you want a visual explanation of why bytes above 7F don't represent the corresponding 8859-1 characters, look at the UTF-8 coding unit table on Wikipedia. You will see that every byte value outside the ASCII range either already has a meaning, or is illegal for historical reasons. There just isn't room in the table for bytes to represent their 8859-1 equivalents, and giving the bytes additional meanings would break several important properties of UTF-8.


Because many languages ​​it 2 bit encoding that is simply not enough to encode all the letters of all alphabets Look 2 bit encoding 00 .. FF 15 ^ 2 = 255 characters 4 bit 0000 ... FFFF 4 ^ 15 = 50625

0

精彩评论

暂无评论...
验证码 换一张
取 消