开发者

Nasty unicode and C++: Easy way to read ASCII/UTF-8/UTF-16 BE/LE text file

开发者 https://www.devze.com 2022-12-16 13:26 出处:网络
sorry if the question is stupid and has been asked thousands of times but I spent a few hours googling it and could not find an answer.

sorry if the question is stupid and has been asked thousands of times but I spent a few hours googling it and could not find an answer.

I want to read in text file which can be any of these: ASCII/UTF-8/UTF-16 BE/LE I assume that if file is u开发者_JS百科nicode then BOM is always present.

Is there any automatic way (STL,Boost or something else) to use file stream or anything to read in file line by line without checking BOMs and always getting UTF8 to put into std::string?

In this project I am using Windows only. It would also be good to know how to solve it for other platforms.

Thanks in advance!


libiconv


BOMs are often not present in UTF-8 files. As a consequence, you can't know if a file is ASCII or UTF-8 until after you have read the data and found a byte which isn't ASCII.

Furthermore, as you are on Windows, do you intend to handle ISO-8859-1 and Windows-1252 as well? The later is often the default for files from things like Notepad and Wordpad. In these cases, things are even worse: One can only distinguish heuristically between such encodings, other encodings and UTF-8.

The ICU library has a character set detection system that you can use to guess the likely character encoding of a file. I do not believe that iconv has such a function.

ICU is generally available, already installed on Mac and Linux, but, alas, not Windows. Such a routine might be available in Win32 API as well.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号