开发者

Inconsistency in Unicode with wchar_t vs. ICU in C++

开发者 https://www.devze.com 2023-01-31 23:00 出处:网络
While wchar_t is inconsistent in case of support on different compilers, but is it safe to assume wchar_t implementation and size are similar in GNU开发者_StackOverflow社区/GCC at least on Linux ?

While wchar_t is inconsistent in case of support on different compilers, but is it safe to assume wchar_t implementation and size are similar in GNU开发者_StackOverflow社区/GCC at least on Linux ?

Despite to the fact that wchar_t size has system architecture dependency in terms of bit-size (32bit/64bit) is Wide Character Type on Linux (GNU/GCC) actually compiler dependent or libstdc++ libraries dependent? I mean by changing or upgrading which one I should consider that wchar_t might not work as expected in terms of size and support

While IBM ICU is another option, can it be used in conjunction with std::string ?

Should I totally dismiss wchar_t in favor of ICU?

Note: On Unix Like Operating Systems such as Linux with GNU/GCC libstdc++ brings core C++ functionality to the compiler, thus occasionally updated.


If you want to present strings to the user, you might have to take wchar_t (or some other library defined type) into consideration. Different compilers and platforms define wchar_t differently, because they use different Unicode encoding techniques. On Windows/Visual C++ for instance, wchar_t is a 16 bit type, suitable for UTF-16. On GCC/Linux for instance, wchar_t is a 32 bit type, suitable for UTF-32.

The IBM ICU library has conversion functions for transforming from one encoding to another. Your platform (Win32 for instance) might also have functions for transforming from one encoding to another.

Depending on your requirements (speed, memory usage), you should pick an internal format that suits the platform. On Windows it might be UTF-16, and on Linux it might be UTF-32. That way you won't have to transcode strings all the time, just to make simple platform-defined operations on them (wcslen(), wcscmp(), etc).

For external formats (text files, etc), I tend to use UTF-8. The reason is that files are considerably smaller if they contain text in a western language. Another benefit is that you don't have to consider endianess in UTF-8, which makes the chance of errors (on your or some other's part) less likely.

The IBM ICU is a very big and competent library for handling Unicode strings. Although, it might be using a sledge hammer to drive in a small nail. Do you need all of its functionality? The Unicode functionality supported by the target platform might meet your requirements.


In principle, yes, wchar_t can change with a new compiler version (it is a language feature though, not a library one, so it doesn't depend on libraries).

In practice, the odds of it suddenly changing size are pretty much zero.

It's not really clear what you actually need though. wchar_t just allows you to store wide characters, and not much more. ICU is a complete unicode library which does a lot more, and is pretty much essential if you want to do more complex text processing than simply printing strings.

Finally, on *nix, plain char's, or std::string usually use an UTF-8 encoding, so those are perfectly suitable for storing Unicode text. wchar_t is rarely used for that reason.

0

精彩评论

暂无评论...
验证码 换一张
取 消