Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI_问答_开发者

I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but 开发者_如何学JAVAit seems like nobody on the team really has much of an understanding beyond that.

I already read the question titled "std::wstring VS std::string. It was very helpful, but I still don't quite understand how to apply all of that information to my problem.

The program I'm working on displays data in a Windows GUI. That data is persisted as XML. We often transform that XML using XSLT into HTML or XSL:FO for reporting purposes.

My feeling based on what I have read is that the HTML should be encoded as UTF-8. I know very little about GUI development, but the little bit I have read indicates that the GUI stuff is all based on UTF-16 encoded strings.

I'm trying to understand where this leaves me. Say we decide that all of our persisted data should be UTF-8 encoded XML. Does this mean that in order to display persisted data in a UI component, I should really be performing some sort of explicit UTF-8 to UTF-16 transcoding process?

I suspect my explanation could use clarification, so I'll try to provide that if you have any questions.

Windows from NT4 onwards is based on Unicode encoded strings, yes. Early versions were based on UCS-2, which is the predecessor of UTF-16, and thus does not support all of the characters that UTF-16 does. Later versions are based on UTF-16. Not all OSes are based on UTF-16/UCS-2, though. *nix systems, for instance, are based on UTF-8 instead.

UTF-8 is a very good choice for storing data persistently. It is a universally supported encoding in all Unicode environments, and it is a good balance between data size and loss-less data compatibility.

Yes, you would have to parse the XML, extract the necessary information from it, and decode and transform it into something the UI can use.

std::wstring is technically UCS-2: two bytes are used for each character and the code tables mostly map to Unicode format. It's important to understand that UCS-2 is not the same as UTF-16! UTF-16 allows "surrogate pairs" in order to represent characters which are outside of the two-byte range, but UCS-2 uses exactly two bytes for each character, period.

The best rule for your situation is to do your transcoding when you read and write to the disk. Once it's in memory, keep it in UCS-2 format. Windows APIs will read it as if it were UTF-16 (which is to say, while std::wstring doesn't understand the concept of surrogate pairs, if you manually create them (which you won't, if your only language is English), Windows will read them).

Whenever you're reading data in or out of serialization formats (such as XML) in the modern day, you'll probably need to do transcoding. It's an unpleasant and very unfortunate fact of life, but inevitable since Unicode is a variable-width character encoding and most character-based operations in C++ are done as arrays, for which you need consistent spacing.

Higher-level frameworks, such as .NET, obscure most of the details, but behind the scenes, they're handling the transcoding in the same fashion: changing variable-width data to fixed-width strings, manipulating them, and then changing them back into variable-width encodings when required for output.

AFAIK when you work with std::wstring on Windows in C++ and store using UTF-8 in files (which sounds good and reasonable), then you have to convert the data to UTF-8 when writing to a file, and convert back to UTF-16 when reading from a file. Check out this link: Writing UTF-8 Files in C++.

I would stick with the Visual Studio default of project -> Properties -> Configuration Properties -> General -> Character Set -> Use Unicode Character Set, use the wchar_t type (i.e. with std::wstring) and not use the TCHAR type. (E.g. I would just use the wcslen version of strlen and not _tcslen.)

One advantage to using std::wstring on Windows for GUI related strings, is that internally all Windows API calls use and operate on UTF-16. If you've ever noticed there are 2 versions of all Win32 API calls that take string arguments. For example, "MessageBoxA" and "MessageBoxW". Both definitions exist in , and in fact you can call either you want, but if is included with Unicode support enabled, then the following will happen:

#define MessageBox MessageBoxW

Then you get into TCHAR's and other Microsoft tricks to try and make it easier to deal with APIs that have both an ANSI and Unicode version. In short, you can call either, but under the hood the Windows kernel in Unicode based, so you'll be paying the cost of converting to Unicode for each string accepting Win32 API call if you don't use the wide char version.

UTF-16 and Windows kernel use

Even if you say you only have English in your data, you're probably wrong. Since we're in a global world now, names/addresses/etc have foreign characters. OK, I do not know what type of data you have, but generally I would say build your application to support UNICODE for both storing data and displaying data to user. That would suggest using XML with UTF-8 for storing and UNICODE versions of Windows calls when you do GUI. And since Windows GUI uses UTF-16, where each token is 16-bit, I would suggest storing the data in the application in an 16-bit wide string. And I would guess your compiler for windows would have std::wstring as 16-bit for just this purpose.

So then you have to do a lot of conversion between UTF-16 and UTF-8. Do that with some existing library, like for instance ICU.