How to use Unicode in C++?_问答_开发者_运维开发者技术经验分享

Assuming a very simple program that:

ask a name.
store the name in a variable.
display the variable content on the screen.

It's so simple that is the first thing that one learns.

But my problem is that I don't know how to do the same thing if I enter the name using japanese characters.

So, if you know how to do this in C++, please show me an example (that I can compile and test)

Thanks.

user362981 : Thanks for your help. I compiled the code that you wrote without problem, them the console window appears and I cannot enter any Japanese characters on it (using IME). Also if I change a word in your code ("hello") to one that contains Japanese characters, it also will not display these.

Svisstack : Also thanks for your help. But when I compile your code I get the following error:

warning开发者_开发技巧: deprecated conversion from string constant to 'wchar_t*'
error: too few arguments to function 'int swprintf(wchar_t*, const wchar_t*, ...)'
error: at this point in file
warning: deprecated conversion from string constant to 'wchar_t*'

You're going to get a lot of answers about wide characters. Wide characters, specifically wchar_t do not equal Unicode. You can use them (with some pitfalls) to store Unicode, just as you can an unsigned char. wchar_t is extremely system-dependent. To quote the Unicode Standard, version 5.2, chapter 5:

With the wchar_t wide character type, ANSI/ISO C provides for inclusion of fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but requires that the characters from the portable C execution set correspond to their wide character equivalents by zero extension.

and that

The width of wchar_t is compiler-specific and can be as small as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be Unicode characters in some compilers.

So, it's implementation defined. Here's two implementations: On Linux, wchar_t is 4 bytes wide, and represents text in the UTF-32 encoding (regardless of the current locale). (Either BE or LE depending on your system, whichever is native.) Windows, however, has a 2 byte wide wchar_t, and represents UTF-16 code units with them. Completely different.

A better path: Learn about locales, as you'll need to know that. For example, because I have my environment setup to use UTF-8 (Unicode), the following program will use Unicode:

#include <iostream>

int main()
{
    setlocale(LC_ALL, "");
    std::cout << "What's your name? ";
    std::string name;
    std::getline(std::cin, name);
    std::cout << "Hello there, " << name << "." << std::endl;
    return 0;
}

...

$ ./uni_test
What's your name? 佐藤 幹夫
Hello there, 佐藤 幹夫.
$ echo $LANG
en_US.UTF-8

But there's nothing Unicode about it. It merely reads in characters, which come in as UTF-8 because I have my environment set that way. I could just as easily say "heck, I'm part Czech, let's use ISO-8859-2": Suddenly, the program is getting input in ISO-8859-2, but since it's just regurgitating it, it doesn't matter, the program will still perform correctly.

Now, if that example had read in my name, and then tried to write it out into an XML file, and stupidly wrote <?xml version="1.0" encoding="UTF-8" ?> at the top, it would be right when my terminal was in UTF-8, but wrong when my terminal was in ISO-8859-2. In the latter case, it would need to convert it before serializing it to the XML file. (Or, just write ISO-8859-2 as the encoding for the XML file.)

On many POSIX systems, the current locale is typically UTF-8, because it provides several advantages to the user, but this isn't guaranteed. Just outputting UTF-8 to stdout will usually be correct, but not always. Say I am using ISO-8859-2: if you mindlessly output an ISO-8859-1 "è" (0xE8) to my terminal, I'll see a "č" (0xE8). Likewise, if you output a UTF-8 "è" (0xC3 0xA8), I'll see (ISO-8859-2) "Ă¨" (0xC3 0xA8). This barfing of incorrect characters has been called Mojibake.

Often, you're just shuffling data around, and it doesn't matter much. This typically comes into play when you need to serialize data. (Many internet protocols use UTF-8 or UTF-16, for example: if you got data from an ISO-8859-2 terminal, or a text file encoded in Windows-1252, then you have to convert it, or you'll be sending Mojibake.)

Sadly, this is about the state of Unicode support, in both C and C++. You have to remember: these languages are really system-agnostic, and don't bind to any particular way of doing it. That includes character-sets. There are tons of libraries out there, however, for dealing with Unicode and other character sets.

In the end, it's not all that complicated really: Know what encoding your data is in, and know what encoding your output should be in. If they're not the same, you need to do a conversion. This applies whether you're using std::cout or std::wcout. In my examples, stdin or std::cin and stdout/std::cout were sometimes in UTF-8, sometimes ISO-8859-2.

Try replacing cout with wcout, cin with wcin, and string with wstring. Depending on your platform, this may work:

#include <iostream>
#include <string>

int main() {
  std::wstring name;
  std::wcout << L"Enter your name: "; 
  std::wcin >> name;
  std::wcout << L"Hello, " << name << std::endl;
}

There are other ways, but this is sort of the "minimal change" answer.

Pre-requisite: http://www.joelonsoftware.com/articles/Unicode.html

The above article is a must read which explains what unicode is but few lingering questions remains. Yes UNICODE has a unique code point for every character in every language and furthermore they can be encoded and stored in memory potentially differently from what the actual code is. This way we can save memory by for example using UTF-8 encoding which is great if the language supported is just English and so the memory representation is essentially same as ASCII – this of course knowing the encoding itself. In theory if we know the encoding, we can store these longer UNICODE characters however we like and read it back. But real world is a little different.

How do you store a UNICODE character/string in a C++ program? Which encoding do you use? The answer is you don’t use any encoding but you directly store the UNICODE code points in a unicode character string just like you store ASCII characters in ASCII string. The question is what character size should you use since UNICODE characters has no fixed size. The simple answer is you choose character size which is wide enough to hold the highest character code point (language) that you want to support.

The theory that a UNICODE character can take 2 bytes or more still holds true and this can create some confusion. Shouldn’t we be storing code points in 3 or 4 bytes than which is really what represents all unicode characters? Why is Visual C++ storing unicode in wchar_t then which is only 2 bytes, clearly not enough to store every UNICODE code point?

The reason we store UNICODE character code point in 2 bytes in Visual C++ is actually exactly the same reason why we were storing ASCII (=English) character into one byte. At that time, we were thinking of only English so one byte was enough. Now we are thinking of most international languages out there but not all so we are using 2 bytes which is enough. Yes it’s true this representation will not allow us to represent those code points which takes 3 bytes or more but we don’t care about those yet because those folks haven’t even bought a computer yet. Yes we are not using 3 or 4 bytes because we are still stingy with memory, why store the extra 0(zero) byte with every character when we are never going to use it (that language). Again this is exactly the same reasons why ASCII was storing each character in one byte, why store a character in 2 or more bytes when English can be represented in one byte and room to spare for those extra special characters!

In theory 2 bytes are not enough to present every Unicode code point but it is enough to hold anything that we may ever care about for now. A true UNICODE string representation could store each character in 4 bytes but we just don’t care about those languages.

Imagine 1000 years from now when we find friendly aliens and in abundance and want to communicate with them incorporating their countless languages. A single unicode character size will grow further perhaps to 8 bytes to accommodate all their code points. It doesn’t mean we should start using 8 bytes for each unicode character now. Memory is limited resource, we allocate what what we need.

Can I handle UNICODE string as C Style string?

In C++ an ASCII strings could still be handled in C++ and that’s fairly common by grabbing it by its char * pointer where C functions can be applied. However applying current C style string functions on a UNICODE string will not make any sense because it could have a single NULL bytes in it which terminates a C string.

A UNICODE string is no longer a plain buffer of text, well it is but now more complicated than a stream of single byte characters terminating with a NULL byte. This buffer could be handled by its pointer even in C but it will require a UNICODE compatible calls or a C library which could than read and write those strings and perform operations.

This is made easier in C++ with a specialized class that represents a UNICODE string. This class handles complexity of the unicode string buffer and provide an easy interface. This class also decides if each character of the unicode string is 2 bytes or more – these are implementation details. Today it may use wchar_t (2 bytes) but tomorrow it may use 4 bytes for each character to support more (less known) language. This is why it is always better to use TCHAR than a fixed size which maps to the right size when implementation changes.

How do I index a UNICODE string?

It is also worth noting and particularly in C style handling of strings that they use index to traverse or find sub string in a string. This index in ASCII string directly corresponded to the position of item in that string but it has no meaning in a UNICODE string and should be avoided.

What happens to the string terminating NULL byte?

Are UNICODE strings still terminated by NULL byte? Is a single NULL byte enough to terminate the string? This is an implementation question but a NULL byte is still one unicode code point and like every other code point, it must still be of same size as any other(specially when no encoding). So the NULL character must be two bytes as well if unicode string implementation is based on wchar_t. All UNICODE code points will be represented by same size irrespective if its a null byte or any other.

Does Visual C++ Debugger shows UNICODE text?

Yes, if text buffer is type LPWSTR or any other type that supports UNICODE, Visual Studio 2005 and up support displaying the international text in debugger watch window (provided fonts and language packs are installed of course).

Summary:

C++ doesn’t use any encoding to store unicode characters but it directly stores the UNICODE code points for each character in a string. It must pick character size large enough to hold the largest character of desirable languages (loosely speaking) and that character size will be fixed and used for all characters in the string.

Right now, 2 bytes are sufficient to represent most languages that we care about, this is why 2 bytes are used to represent code point. In future if a new friendly space colony was discovered that want to communicate with them, we will have to assign new unicode code pionts to their language and use larger character size to store those strings.

You can do simple things with the generic wide character support in your OS of choice, but generally C++ doesn't have good built-in support for unicode, so you'll be better off in the long run looking into something like ICU.

#include <stdio.h>
#include <wchar.h>

int main()
{
    wchar_t name[256];

    wprintf(L"Type a name: ");
    wscanf(L"%s", name);

    wprintf(L"Typed name is: %s\n", name);

    return 0;
}