Question on unicode translation_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-10 06:13 出处：网络

How does a variable length UTF-8 encoded bytes are deco开发者_JAVA技巧ded/translated to unicode characters?Each byte with value above 127 (binary 01111111, 7f hex) is a part of a multibyte character.

How does a variable length UTF-8 encoded bytes are deco开发者_JAVA技巧ded/translated to unicode characters?

Each byte with value above 127 (binary 01111111, 7f hex) is a part of a multibyte character.

So, if the first bit is 0, done - single byte character. If not, this is a continuation byte - the bits in the byte also determine how many bytes are in this character (technically up to 6-byte characters would be possible, but UTF-8 is only defined for 1-4 byte characters).

For a history and a more detailed explanation, see this article by Our Fearless Leader ;) The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), or this wikipedia article on UTF-8 (has more technical details on valid/invalid byte combinations)

i think it's here

UTF-8 is Unicode, so there is no translation. If you mean "How do I see non-ASCII characters on screen when I'm displaying a Unicode string", you need to ensure you have a Unicode-capable font installed & in use.

My company is using this font.

It is as @Piskvor describes.

The algorithms for encoding/decoding UTF-8 is described in RFC 3629.

The following table of (32 bit) Unicode code point ranges to byte sequences comes from that document:

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx