开发者

Signedness of char and Unicode in C++0x

开发者 https://www.devze.com 2022-12-21 22:27 出处:网络
From the C++0x working draft, the new char types (char16_t and char32_t) for handling Unicode will be unsigned (uint_least16_t and uint_least32_t will be the underlyin开发者_StackOverflowg types).

From the C++0x working draft, the new char types (char16_t and char32_t) for handling Unicode will be unsigned (uint_least16_t and uint_least32_t will be the underlyin开发者_StackOverflowg types).

But as far as I can see (not very far perhaps) a type char8_t (based on uint_least8_t) is not defined. Why ?

And it's even more confusing when you see that a new u8 encoding prefix is introduced for UTF-8 string literal... based on old friend (sign/unsigned) char. Why ?

Update : There's a proposal to add a new type : char8_t

char8_t: A type for UTF-8 characters and strings (Revision 1) http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0482r1.html


char will be the type used for UTF-8 because it's redefined to be sure it can be used with it:

For the purpose of enhancing support for Unicode in C++ compilers, the definition of the type char has been modified to be both at least the size necessary to store an eight-bit coding of UTF-8 and large enough to contain any member of the compiler's basic execution character set. It was previously defined as only the latter. There are three Unicode encodings that C++0x will support: UTF-8, UTF-16, and UTF-32. In addition to the previously noted changes to the definition of char, C++0x will add two new character types: char16_t and char32_t. These are designed to store UTF-16 and UTF-32 respectively.

Source : http://en.wikipedia.org/wiki/C%2B%2B0x

Most of UTF-8 application uses char already anyway on PC/mac.


char16_t and char32_t are supposed to be usable for representing code points. Since there are no negative code points, it's sensible for these to be unsigned.

UTF-8 does not represent code points directly, so it doesn't matter whether u8's underlying type is signed or not.


The C++0x draft doesn't seem to indicate whether or not the new Unicode character types are signed or unsigned. However, as others have already mentioned, since there are no negative Unicode codepoints it would make more sense for char16_t and char32_t to be unsigned. (Then again, it would have made sense for char to be unsigned, yet we've been dealing with "negative" characters since the 70s.)

Also, since UTF-16 ranges from 0x0 through 0xFFFF (ignoring surrogate pairs), you'd need the entire range of an unsigned 16-bit integer to properly represent all values. It would be awkward, to say the least, if codepoints 0x8000 through 0xFFFF were represented as negative numbers with a char16_t.

Anyway, until the C++0x committee says something definitive on the matter, you can always just check your implementation:

#include <type_traits>
#include <iostream>

int main()
{
    std::cout << std::boolalpha << std::is_signed<char16_t>::value << std::endl;
}

This prints out false using GCC 4.45 on Linux. So on one platform, at least, the new Unicode types are definitely unsigned.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号