I have a wstring
declared as such:
// random wstring
std::wstring str = L"abcàdëefŸg€hhhhhhhµa";
The literal would be UTF开发者_StackOverflow社区-8 encoded, because my source file is.
[EDIT: According to Mark Ransom this is not necessarily the case, the compiler will decide what encoding to use - let us instead assume that I read this string from a file encoded in e.g. UTF-8]
I would very much like to get this into a file reading (when text editor is set to the correct encoding)
abcàdëefŸg€hhhhhhhµa
but ofstream
is not very cooperative (refuses to take wstring
parameters), and wofstream
supposedly needs to know locale and encoding settings. I just want to output this set of bytes. How does one normally do this?
EDIT: It must be cross platform, and should not rely on the encoding being UTF-8. I just happen to have a set of bytes stored in a wstring
, and want to output them. It could very well be UTF-16, or plain ASCII.
For std::wstring
you need std::wofstream
std::wofstream f(L"C:\\some file.txt");
f << str;
f.close();
std::wstring
is for something like UTF-16 or UTF-32, not UTF-8. For UTF-8, you probably just want to use std::string
, and write out via std::cout
. Just FWIW, C++0x will have Unicode literals, which should help clarify situations like this.
Why not write the file as a binary. Just use ofstream with the std::ios::binary setting. The editor should be able to interpret it then. Don't forget the Unicode flag 0xFEFF at the beginning. You might be better of writing with a library, try one of these:
http://www.codeproject.com/KB/files/EZUTF.aspx
http://www.gnu.org/software/libiconv/
http://utfcpp.sourceforge.net/
There is a (Windows-specific) solution that should work for you here. Basically, convert wstring
to UTF-8 codepage and then use ofstream
.
#include < windows.h >
std::string to_utf8(const wchar_t* buffer, int len)
{
int nChars = ::WideCharToMultiByte(
CP_UTF8,
0,
buffer,
len,
NULL,
0,
NULL,
NULL);
if (nChars == 0) return "";
string newbuffer;
newbuffer.resize(nChars) ;
::WideCharToMultiByte(
CP_UTF8,
0,
buffer,
len,
const_cast< char* >(newbuffer.c_str()),
nChars,
NULL,
NULL);
return newbuffer;
}
std::string to_utf8(const std::wstring& str)
{
return to_utf8(str.c_str(), (int)str.size());
}
int main()
{
std::ofstream testFile;
testFile.open("demo.xml", std::ios::out | std::ios::binary);
std::wstring text =
L"< ?xml version=\"1.0\" encoding=\"UTF-8\"? >\n"
L"< root description=\"this is a naïve example\" >\n< /root >";
std::string outtext = to_utf8(text);
testFile << outtext;
testFile.close();
return 0;
}
C++ has means to perform a conversion from wide character to localized ones on output or file write. Use codecvt facet for that purpose.
You may use standard std::codecvt_byname, or a non-standard codecvt_facet implementation.
#include <locale>
using namespace std;
typedef codecvt_facet<wchar_t, char, mbstate_t> Cvt;
locale utf8locale(locale(), new codecvt_byname<wchar_t, char, mbstate_t> ("en_US.UTF-8"));
wcout.imbue(utf8locale);
wcout << L"Hello, wide to multybyte world!" << endl;
Beware that on some platforms codecvt_byname can only emit conversion only for locales that are installed in the system. I therefore recommend to search stackoverflow for "utf8 codecvt " and make a choice from many referenes of custom codecvt implementations listed.
EDIT: As OP states that the string is already encoded, all he should do is to remove prefixes L and "w" from every token of his code.
Note that wide streams output only char * variables, so maybe you should try using the c_str()
member function to convert a std::wstring
and then output it to the file. Then it should probably work?
You should not use UTF-8 encoded source file if you want to write portable code. Sorry.
std::wstring str = L"abcàdëefŸg€hhhhhhhµa";
(I am not sure if this actually hurts the standard, but I think it is. But even if, to be safe you should not.)
Yes, purely using std::ostream
will not work. There are many ways to convert a wstring
to UTF-8. My favorite is using the International Components for Unicode. It's a big lib, but it's great. You get a lot of extras and things you might need in the future.
I had the same problem some time ago, and wrote down the solution I found on my blog. You might want to check it out to see if it might help, especially the function wstring_to_utf8
.
http://pileborg.org/b2e/blog5.php/2010/06/13/unicode-utf-8-and-wchar_t
From my experience of working with different character encodings I would recommend that you only deal with UTF-8 at load and save time. You're in for a world of pain if you try and store the internal representation in UTF-8 since a single character could be anything from 1 byte to 4. So simple operations like strlen require looking at every byte to decide len rather than the allocated buffer (although you can optimize by looking at the first byte in the char sequence, e.g. 00..7f is a single byte char, c2..df indicates a 2 byte char etc).
People quite often refer to 'Unicode strings' when they mean UTF-16 and on Windows a wchar_t is a fixed 2 bytes. In Windows I think wchar_t is simply:
typedef SHORT wchar_t;
The full UTF-32 4 byte representation is rarely required and very wasteful, here what the Unicode Standard (5.0) has to say on it:
"On average more than 99% of all UTF-16 is expressed using single code units... UTF-16 provides the right mix of compact size with the ability to handle the occassional character outside the BMP"
In short, use whcar_t as your internal representation and do conversions when loading and saving (and don't worry about full Unicode unless you know you need it).
With regard to performing the actual conversion have a look at the ICU project:
http://site.icu-project.org/
精彩评论