开发者

Inconsistencies in ASCII conversion of UTF-8 in Properties files

开发者 https://www.devze.com 2022-12-09 18:04 出处:网络
Does anyone know why native2ascii generates lower-case hex codes, while Properties.store() produces upper-case hex?

Does anyone know why native2ascii generates lower-case hex codes, while Properties.store() produces upper-case hex?

Example:

保存 is encoded as \u4FDD\u5B58 when using Properties.store(), but is encoded as \u4fdd\u5b58 when using native2ascii

Is there any way to co开发者_StackOverflowntrol this?


I don't know why but I do know it doesn't matter (to Java anyway, it may matter to you a great deal). Unicode escapes are allowed to have upper or lower case hex digits so it really doesn't matter to Java which one is used (even mixed case is valid).

The reason they're different is probably something as simple as they were written by two different people.

Is there any way to control it? Not easily from what I can see. It doesn't appear that native2ascii has any options to control that output (it allows options to control the JVM but not to that level).

Properties.store() uses an OutputStream (and Properties.load() uses an InputStream) which you could probably subclass to filter the Unicode escapes but that seems an awful lot of work for (what looks like) dubious benefit.

Perhaps if you could tell us why you need this, there may be another way.

Update 1:

One thing that you could do is to pass the native2ascii output through a filter which turns the Unicode escape sequeces into uppercase. The following code ucunicode.c should be able to do this although I've only given it cursory testing. Simply execute:

native2ascii inputFile | ucunicode

and you should see the likes of \u00EF\u00BB\u00BF instead of \u00ef\u00bb\u00bf.

#include <stdio.h>

int main (void) {
    int count = 0;     // used for converting four hex digits after "\u".
    int chminus2 = -1; // character from two passes ago.
    int chminus1 = -1; // character from one pass ago.
    int ch;            // character for this pass.

    // Standard filter loop.

    while ((ch = getc (stdin)) != EOF) {
        if (count-- > 0) {
            // If processing Unicode escape sequence, uppercase letters.

            putchar (((ch >= 'a') && (ch <= 'f')) ? ch - 'a' + 'A' : ch);
        } else {
            // Normal processing, detect escape sequence and flag it.

            if ((chminus2 != '\\') && (chminus1 == '\\') && (ch == 'u')) {
                count = 4;
            }

            // In any case, output the character.

            putchar (ch);
        }

        // Shift characters "left".

        chminus2 = chminus1;
        chminus1 = ch;
    }
    return 0;

}

There may be edge cases that this doesn't handle well. I'm pretty certain it will handle all valid input but may break on invalid input like \u1\\u0000 but, since that means your native2ascii is broken, you'll need to debug them yourself. This is a good start however.

Update 2:

Or, as a last-ditch solution, the OpenJDK project has the actual source files for native2ascii in jdk\src\share\classes\sun\tools\native2ascii\ (and just about everything else that's not encumbered by copyright) which you could bring down and compile yourself (GPL2 applies). The files are Main.java, A2NFilter.java and N2AFilter.java (and a couple of resource files). You'd simply have to change N2AFilter.java to call:

String hex = Integer.toHexString(buf[i]).toUpperCase();

instead of just:

String hex = Integer.toHexString(buf[i]);

In fact, by examining that source code, you can see that Properties.store() (in jdk/src/share/classes/java/util/Properties.java) uses the following functions to create it's Unicode escapes:

private static final char[] hexDigit = {
    '0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'
};
private static char toHex (int nibble) {
    return hexDigit[(nibble & 0xF)];
}

This explains why it generates upper case while native2ascii produces lower case.


I just hit on this, and I have a reason this is annoying: I just converted a (unicode) .properties-file to another (homebrewed XML) format, with an export function back to .properties. This function uses Properties.store(), while the original Unicode .properties was converted by ant via native2ascii. I now wanted to compare they produce a similar result, so applied sort on each and diff on the result. Most of the resulting different lines are in fact due to these case differences in the \u-escapes. (I think I'll use a quick sed script to convert the case of one of the files.)


So, here is our sed script: s/\\u([0-9A-F]{4})/\\u\L\1\E/g (changes to lowercase), or s/\\u([0-9A-F]{4})/\\u\U\1\E/g (changes to uppercase). I had to change a bit more, since Properties.store() also escaped more signs like ! to \!, = to \=, : to \:.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号