开发者

How to get Unicode for Chracter strings(UTF-8) in c or c++ language (Linux)

开发者 https://www.devze.com 2023-02-19 10:48 出处:网络
I am working on one application in which i need to know Unicode of Characters to classify them like Chinese Characters, Japanese Characters(Kanji,Katakana,Hiragana) , Latin , Greeketc .开发者_开发问答

I am working on one application in which i need to know Unicode of Characters to classify them like Chinese Characters, Japanese Characters(Kanji,Katakana,Hiragana) , Latin , Greek etc .

开发者_开发问答

The given string is in UTF-8 Format.

If there is any way to know Unicode for UTF-8 Character? For example:

  1. Character '≠' has U+2260 Unicode value.
  2. Character '建' has U+5EFA Unicode value.


The utf-8 encoding is a variable width encoding of unicode. Each unicode code point can be encoded from one to four char.

To decode a char* string and extract a single code point, you read one byte. If the most significant bit is set then, the code point is encoded on multiple characters, otherwise it is the unicode code point. The number of bits set counting from the most-significant bit indicate how many char are used to encode the unicode code point.

This table explain how to make the conversion:

UTF-8 (char*)                       | Unicode (21 bits)
------------------------------------+--------------------------
0xxxxxxx                            | 00000000000000000xxxxxxx
------------------------------------+--------------------------
110yyyyy 10xxxxxx                   | 0000000000000yyyyyxxxxxx
------------------------------------+--------------------------
1110zzzz 10yyyyyy 10xxxxxx          | 00000000zzzzyyyyyyxxxxxx 
------------------------------------+--------------------------
11110www 10zzzzzz 10yyyyyy 10xxxxxx | 000wwwzzzzzzyyyyyyxxxxxx

Based on that, the code is relatively straightforward to write. If you don't want to write it, you can use a library that does the conversion for you. There are many available under Linux : libiconv, icu, glib, ...


libiconv can help you with converting the utf-8 string to utf-16 or utf-32. Utf-32 would be the savest option if you really want to support every possible unicode codepoint.

0

精彩评论

暂无评论...
验证码 换一张
取 消