开发者

Determine input encoding by examining the input bytes

开发者 https://www.devze.com 2022-12-16 19:34 出处:网络
I\'m getting console input from the user and want to encode it to UTF-8.My understanding is C++ does not have a standard encoding for i开发者_如何学Pythonnput streams, and that it instead depends on t

I'm getting console input from the user and want to encode it to UTF-8. My understanding is C++ does not have a standard encoding for i开发者_如何学Pythonnput streams, and that it instead depends on the compiler, the runtime environment, localization, and what not.

How can I determine the input encoding by examining the bytes of the input?


In general, you can't. If I shoot a stream of randomly generated bytes at your app how can it determine their "encoding"? You simply have to specify that your application accepts certain encodings, or make an assumption that what the OS hands you will be suitably encoded.


Generally checking whether input is UTF is a matter of heuristics -- there's no definitive algorithm that'll state you "yes/no". The more complex the heuristic, the less false positives/negatives you will get, however there's no "sure" way.

For an example of heuristics you can check out this library : http://utfcpp.sourceforge.net/

bool valid_utf8_file(iconst char* file_name)
{
    ifstream ifs(file_name);
    if (!ifs)
        return false; // even better, throw here

    istreambuf_iterator<char> it(ifs.rdbuf());
    istreambuf_iterator<char> eos;

    return utf8::is_valid(it, eos);
}

You can either use it, or check its sources how they have done it.


Use the built-in operating system means. Those vary from one OS to another. On Windows, it's always better to use WideChar APIs and not think of encoding at all.

And if your input comes from a file, as opposed to a real console, then all bets are off.


Jared Oberhaus answered this well on a related question specific to java.

Basically there are a few steps you can take to make a reasonable guess, but ultimately it's just guesswork without explicit indication. (Hence the (in)famous BOM marker in UTF-8 files)


As has already been said in response to the question John Weldon has pointed to, there are a number of libraries which do character encoding recognition. You could also take a look at the source of the unix file command and see what tests it uses to determine file encoding. From the man page of file:

ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII character sets (such as those used on Macintosh and IBM PC systems), UTF-8-encoded Unicode, UTF-16-encoded Unicode, and EBCDIC character sets can be distinguished by the different ranges and sequences of bytes that constitute printable text in each set.

PCRE provides a function to test a given string for its completely being valid UTF-8.

0

精彩评论

暂无评论...
验证码 换一张
取 消