Currently, I am developing an app for a China customer. China customer are mostly switch to GB2312 language in their OS encoding. I need to write a text file, which will be encoded using GB2312.
- I use std::ofstream file
- I compile my application under MBCS mode, not unicode.
- I use the following code, to convert CString to std::string, and write it to file using ofstream
std::string Utils::ToString(CString& cString) {
/* Will not work correctly, if we are compiled under unicode mode. */
return (LPCTSTR)cString;
}
To my surprise. It just works. I thought I need to at least make use of wstring. I try to do some investigation.
Here is the MBCS.txt generated.
alt text http://sites.google.com/site/yanchengcheok/Home/stackoverflow0.PNG
- I try to print a single character named 脚 (its value is 0xBDC5)
- When I use CString to carry this character, its length is 2.
- When I use Utils::ToString to perform conversion to std::string, the returned string length is 2.
- I write to file using std::ofstream
My question is :
- When I exam MBCS.txt using a hex editor, the value is displayed as BD (LSB) and 开发者_如何学运维C5 (MSB). But I am using little endian machine. Isn't hex editor should show me C5 (LSB) and BD (MSB)? I check from wikipedia. GB2312 seems doesn't specific endianness.
- It seems that using std::string + CString just work fine for my case. May I know in what case, the above methodology will not work? and when I should start to use wstring?
About 1. Endianness is a problem you meet when you serialize a unit in term of smaller units (i.e. serialize seizets in term of octets). I'm far from being a specialist of CJK encodings, but it seems to me that GB2112 is a coded character set which can be used with several encoding schemes. The encoding schemes cited in the wikipedia page as being used for GB2112 (ISO 2022, EUC-CN and HZ) are all defined in terms of octets. So there is no endianness issue if serialized as octets.
Contrast this with Unicode encoding schemes: UTF-8 is defined in terms of octets and has no endianness issue when serialized as octets, UTF-16 is defined in terms of seizets and if serialized as octets endianness must be specified, UTF-32 is defined in terms of 32 bits units and if serialized as octets endianness must be specified.
精彩评论