开发者

UTF-8 Encoding size

开发者 https://www.devze.com 2023-02-08 20:28 出处:网络
what unicode characters fit in 1, 2开发者_StackOverflow, 4 bytes? Can someone point me to complete character chart? Characters are encoded according to their position in the range. You can actually fi

what unicode characters fit in 1, 2开发者_StackOverflow, 4 bytes? Can someone point me to complete character chart?


Characters are encoded according to their position in the range. You can actually find the algorithm on the Wikipedia page for UTF8 - you can implement it very quickly Wikipedia UTF8 Encoding

  • U+0000 to U+007F are (correctly) encoded with one byte
  • U+0080 to U+07FF are encoded with 2 bytes
  • U+0800 to U+FFFF are encoded with 3 bytes
  • U+010000 to U+10FFFF are encoded with 4 bytes


The wikipedia article on UTF-8 has a good enough description of the encoding:

  • 1 byte = code points 0x000000 to 0x00007F (inclusive)
  • 2 bytes = code points 0x000080 to 0x0007FF
  • 3 bytes = code points 0x000800 to 0x00FFFF
  • 4 bytes = code points 0x010000 to 0x10FFFF

The charts can be downloaded directly from unicode.org. It's a set of about 150 PDF files, because a single chart would be huge (maybe 30 MiB).

Also be aware that Unicode (compared to something like ASCII) is much more complex to process - there's things like right-to-left text, byte order marks, code points that can be combined ("composed") to create a single character and different ways of representing the exact same string (and a process to convert strings into a canonical form suitable for comparison), a lot more white-space characters, etc. I'd recommend downloading the entire Unicode specification and reading most of it if you're planning to do more than "not much".


UTF-8 compromises of 1 to a limit of 6 bytes, although the current amount of code points is covered with just 4 bytes. UTF-8 uses the first byte to determine how long (in bytes) the character is - see the various links to the Wiki page:

UTF-8 Wikipedia

Single byte UTF-8 is effectively ASCII - UTF-8 was designed to be compatible with it, which is why it's more prevalent than UTF-16, for example.


Edit: Apparently, it was agreed the UTF-8's code points would not exceed 21 bits (4 byte sequences) - but it has the technical capability to handle up to 31 bits (6 byte UTF-8).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号