Unicode Portability_问答_开发者_运维开发者技术经验分享

I'm currently taking care of an application that uses std::string and char for string operations - which is fine on linux, since Linux is agnostic to Unicode (or so it seems; I don't really know, so please correct me if I'm telling stories here). This current style naturally leads to this kind of function/class declarations:

std::string doSomethingFunkyWith(const std::string& thisdata)
{
    /* .... */
}

However, if thisdata contains unicode characters, it will be displayed wrongly on windows, since std::string can't hold unicode characters on Windows.

So I thought up this concept:

namespace MyApplication {
#ifdef UNICODE
    typedef std::wstring  string_type;
    typedef wchar_t       char_type;
#else
    typedef std::string   string_type;
    typedef char          char_type;
#endif

    /* ... */
    string_type doSomethingFunkyWith(const string_type& thisdata)
    {
        /* ... */
    }
}

Is this a good concept to go with to support Unicode on windows?

My current toolchain 开发者_运维知识库consists of gcc/clang on Linux, and wine+MinGW for Windows support (crosstesting also happens via wine), if that matters.

Multiplatform issues comes from the fact that there are many encodings, and a wrong encoding pick will lead to encÃ³ding Ãssues. Once you tackle that problem, you should be able to use std::wstring on all your program.

The usual workflow is:

raw_input_data = read_raw_data()
input_encoding = "???" // What is your file or terminal encoding?

unicode_data = convert_to_unicode(raw_input_data, input_encoding)

// Do something with the unicode_data, store in some var, etc.

output_encoding = "???" // Is your terminal output encoding the same as your input?
raw_output_data = convert_from_unicode(unicode_data, output_encoding)

print_raw_data(raw_data)

Most Unicode issues comes from wrongly detecting the values of input_encoding and output_encoding. On a modern Linux distribution this is usually UTF-8. On Windows YMMV.

Standard C++ don't know about encodings, you should use some library like ICU to do the conversion.

How you store a string within your application is entirely up to you -- after all, nobody would know as long as the strings stay within your application. The problem starts when you try to read or write strings from the outside world (console, files, sockets etc.) and this is where the OS matters.

Linux isn't exactly "agnostic" to Unicode -- it does recognize Unicode but the standard library functions assume UTF-8 encoding, so Unicode strings fit into standard char arrays. Windows, on the other hand, uses UTF-16 encoding, so you need a wchar_t array to represent 16-bit characters.

The typedefs you proposed should work fine, but keep in mind that this alone doesn't make your code portable. As an example, if you want to store text in files in a portable manner, you should choose one encoding and stick to it across all platforms -- this could require converting between encodings on certain platforms.

Linux does support Unicode, it simply uses UTF-8. Probably a better way to make your system portable would be to make use of International Components for Unicode and treat all std::string objects as containing UTF-8 characters, and convert them to UTF-16 as needed when invoking Windows functions. It almost always makes sense to use UTF-8 over UTF-16, as UTF-8 uses less space for some of the most commonly used characters (e.g. English*) and more space for less frequent characters, whereas UTF-16 wastes space equally for all characters, no matter how frequently they are used.

While you can use your typedefs, this will mean that you have to write two copies of every single function that has to deal with strings. I think it would be more efficient to simply do all internal computations in UTF-8 and simply translate that to/from UTF-16 if necessary when inputting/outputting as needed.

*For HTML, XML, and JSON that use English as part of the encoding (e.g. "<html>, <body>, etc.) regardless of the language of the values, this can still be a win for foreign languages.

The problem for Linux and using Unicode is that all the IO and most system functions use UTF-8 and the wide character type is 32 bit. Then there is interfacing to Java and other programs which requires UTF-16.

As a suggestion for Unicode support, see the OpenRTL library at http://code.google.com/p/openrtl which supports all UTF-8, UTF-16 and UTF-32 on windows, Linux, Osx and Ios. The Unicode support is not just the character types, but also Unicode collation, normalization, case folding, title casing and about 64 different Unicode character properties per full unsigned 32 bit character.

The OpenRTL code is ready now to support char8_t, char16_t and char32_t for the new C++ standards as well, allthough the same character types are supported using macros for existing C and C++ compilers. I think for Unicode and strings processing that it might be what you want for your library.

The point is that if you use OpenRTL, you can build the system using the OpenRTL "char_t" type. This supports the notion that your entire library can be built in either UTF8, UTF16 or UTF32 mode, even on Linux, because OpenRTL is already handling all the interfacing to a lot of system functions like files and io stuff. It has its own print_f functions for example.

By default the char_t is mapping to the wide character type. So on windows it is 32 bit and on Linux it is 32 bit. But you can make it also make it 8 bit everywhere for example. Also it has the support to do fast UTF decoding inside loops using macros.

So instead of ifdeffing between wchar_t and char, you can build everything using char_t and OpenRTL takes care of the rest.