How does a variable length UTF-8 encoded bytes are deco开发者_JAVA技巧ded/translated to unicode characters?
Each byte with value above 127
(binary 01111111
, 7f
hex) is a part of a multibyte character.
So, if the first bit is 0, done - single byte character. If not, this is a continuation byte - the bits in the byte also determine how many bytes are in this character (technically up to 6-byte characters would be possible, but UTF-8 is only defined for 1-4 byte characters).
For a history and a more detailed explanation, see this article by Our Fearless Leader ;) The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), or this wikipedia article on UTF-8 (has more technical details on valid/invalid byte combinations)
i think it's here
UTF-8 is Unicode, so there is no translation. If you mean "How do I see non-ASCII characters on screen when I'm displaying a Unicode string", you need to ensure you have a Unicode-capable font installed & in use.
My company is using this font.
It is as @Piskvor describes.
The algorithms for encoding/decoding UTF-8 is described in RFC 3629.
The following table of (32 bit) Unicode code point ranges to byte sequences comes from that document:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
精彩评论