开发者

How can I identify different encodings against files without the use of a BOM and beginning with non-ASCII character?

开发者 https://www.devze.com 2023-02-25 05:30 出处:网络
I got a problem when trying to identify the encoding of a file without BOM, particularly when the file is beginning with non-ascii characters.

I got a problem when trying to identify the encoding of a file without BOM, particularly when the file is beginning with non-ascii characters.

I found following two topics about how to identify encodings for files,

  • How can I identify different encodings without the use of a BOM?

  • Java: Readers and Encodings

Currently, I created a class to identify different encodings for files (e.g. UTF-8, UTF-16, UTF-32, UTF-16 no BOM, etc) like following,

public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;

/**
 * Construct UnicodeReader
 * @param in Input stream.
 * @param defaultEncoding Default encoding to be used if BOM is not found,
 * or <code>null</code> to use system default encoding.
 * @throws IOException If an I/O error occurs.
 */
public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
    byte bom[] = new byte[BOM_SIZE];
    String encoding;
    int unread;
    PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
    int n = pushbackStream.read(bom, 0, bom.length);

    // Read ahead four bytes and check for BOM marks.
    if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
        encoding = "UTF-8";
        unread = n - 3;
    } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
        encoding = "UTF-16BE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
        encoding = "UTF-16LE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
        encoding = "UTF-32BE";
        unread = n - 4;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
        encoding = "UTF-32LE";
        unread = n - 4;
    } else {
        // No BOM detected but still could be UTF-16
        int found = 0;
        for (int i = 0; i < 4; i++) {
            if (bom[i] == (byte) 0x00)
                found++;
        }

        if(found >= 2) {
            if(bom[0] == (byte) 0x00){
                encoding = "UTF-16BE";
            }
            else {
                encoding = "UTF-16LE";
            }
            unread = n;
        }
        else {
            encoding = defaultEncoding;
            unread = n;
        }
    }

    // Unread bytes if necessary and skip BOM marks.
    if (unread > 0) {
        pushbackStream.unread(bom, (n - unread), unread);
    } else if (unread < -1) {
        pushbackStream.unread(bom, 0, 0);
    }

    // Use given encoding.
    if (encoding == null) {
        reader = new InputStreamReader(pushbackStream);
    } else {
        reader = new InputStreamReader(pushbackStream, encoding);
    }
}

public String getEncoding() {
    return reader.getEncoding();
}

public int read(char[] cbuf, int off, int len) throws IOException {
    return reader.read(cbuf, off, len);
}

public void close() throws IOException {
    reader.close();
}

}

The above code could work properly all the cases except when file without BOM and beginning with non-ascii characters. Since under this circumstance, the logic for checking 开发者_开发知识库if file still be UTF-16 without BOM will not work correctly, and the encoding will be set as UTF-8 as default.

If there is a way to check encodings of file without BOM and beggining with non-ascii characters, especially for UTF-16 NO BOM file ?

Thanks, any idea would be appreciated.


Generally speaking, there is no way to know encoding for sure if it is not provided.

You may guess UTF-8 by specific pattern in the texts (high bit set, set, set, not set, set, set, set, not set), but it is still a guess.

UTF-16 is a hard one; you can successfully parse BE and LE on the same stream; both ways it will produce some characters (potentially meaningless text though).

Some code out there uses statistical analysis to guess the encoding by the frequency of the symbols, but that requires some assumptions about the text (i.e. "this is a Mongolian text") and frequencies tables (which may not match the text). At the end of the day this remains just a guess, and cannot help in 100% of cases.


The best approach is not to try and implement this yourself. Instead use an existing library to do this; see Java : How to determine the correct charset encoding of a stream. For instance:

  • http://code.google.com/p/juniversalchardet/
  • http://jchardet.sourceforge.net/
  • http://site.icu-project.org/
  • http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
  • http://docs.codehaus.org/display/GUESSENC/Home

It should be noted that the best that can be done is to guess at the most likely encoding for the file. In the general case, it is impossible to be 100% sure that you've figured out the correct encoding; i.e. the encoding that was used when creating the file.


I would say these third-party libraries are also cannot identify encodings for the file I encountered [...] they could be improved to meet my requirement.

Alternatively, you could recognize that your requirement is exceedingly hard to meet ... and change it; e.g.

  • restrict yourself to a certain set of encodings,
  • insist that the person who provides / uploads the file correctly state what its encoding (or primary language) is, and/or
  • accept that your system is going to get it wrong a certain percent of the time, and provide the means whereby someone can correct incorrectly stated / guessed encodings.

Face the facts: this is a THEORETICALLY UNSOLVABLE problem.


If you are certain that it is a valid Unicode stream, it must be UTF-8 if it has no BOM (since a BOM is neither required nor recommended), and if it does have one, then you know what it is.

If it is just some random encoding, there is no way to know for certain. The best you can hope for is then to only be wrong sometimes, since there is impossible to guess correctly in all cases.

If you can limit the possibilities to a very small subset, it is possible to improve the odds of your guess being right.

The only reliable way is to require the provider to tell you what they are providing. If you want complete reliability, that is your only choice. If you do not require reliability, then you guess — but sometimes guess wrong.

I have the feeling that you must be a Windows person, since the rest of us seldom have cause for BOMs in the first place. I know that I regularly deal with tgagabytes of text (on Macs, Linux, Solaris, and BSD systems), more than 99% of it UTF-8, and only twice have I come across BOM-laden text. I have heard Windows people get stuck with it all the time though. If true this may, or may not, make your choices easier.

0

精彩评论

暂无评论...
验证码 换一张
取 消