What is the correct way to read Unicode files line by line in C++?
I am trying to read a file saved as Unicode (LE) by Windows Notepad.
Suppose the file contains simply the characters A and B on separate lines.
In reading the file byte by byte, I see the following byte sequence (hex) :
FE FF 41 00 0D 00 0A 00 42 00 0D 00 0A 00
So 2 byte BOM, 2 byte 'A', 2byte CR , 2byte LF, 2 byte 'B', 2 byte CR, 2 byte LF .
I tried reading the text file using the following code:
std::wifstream 开发者_开发问答file("test.txt");
file.seekg(2); // skip BOM
std::wstring A_line;
std::wstring B_line;
getline(file,A_line); // I get "A"
getline(file,B_line); // I get "\0B"
I get the same results using >> operator instead of getline
file >> A_line;
file >> B_line;
It appears that the single byte CR character is is being consumed only as the single byte. or CR NULL LF is being consumed but not the high byte NULL. I would expect wifstream in text mode would read the 2byte CR and 2byte LF.
What am I doing wrong? It does not seem right that one should have to read a text file byte by byte in binary mode just to parse the new lines.
std::wifstream
exposes the wide character set to your program, which is typically UCS-2 on Windows and UTF-32 on Unix, but assumes that the input file is still using narrow characters. If you want it to behave using wide characters on disk, you need to use a std::codecvt<wchar_t, wchar_t>
facet.
You should just be able to find your compiler's implementation of std::codecvt<char, char>
which is also a non-converting code conversion facet, and change the chars to wchar_ts.
精彩评论