开发者

Convert ISO8859 String to UTF8? ÄÖÜ => ÃÃ why?

开发者 https://www.devze.com 2023-03-09 11:16 出处:网络
Whats the problem with this code? I made an ISO8859 String. So most of the ÄÖÜ are some krypooutput. Thats fine. But how to Convert them back to normal chars (UTF8 or something)?

Whats the problem with this code? I made an ISO8859 String. So most of the ÄÖÜ are some krypooutput. Thats fine. But how to Convert them back to normal chars (UTF8 or something)?

    String s = new String("Üü?öäABC".getBytes(), "ISO-8859-15");

    System.out.println(s);
    //ÃÃŒ?öÀABC => ok(?)
    System.out.println(new String(s.getBytes(), "ISO-8859-15"));
    //ÃÂÃÅ?öÃâ¬ABC => ok(?)
    System.out.printl开发者_StackOverflown(new String(s.getBytes(), "UTF-8"));
    //ÃÃŒ?öÀABC => huh?


A construct such as new String("Üü?öäABC".getBytes(), "ISO-8859-15"); is almost always an error.

What you're doing here is taking a String object, getting the corresponding byte[] in the platform default encoding and re-interpreting it as ISO-8859-15 to convert it back to a String.

If the platform default encoding happens to be ISO-8859-15 (or near enough to make no difference for this particular String, for example ISO-8859-1), then it is a no-op (i.e. it has no real effect).

In all other cases it will most likely destroy the String.

If you try to "fix" a String, then you're probably too late: if you have to use a specific encoding to read data, then you should use it at the point where binary data is converted to String data. For example if you read from an InputStream, you need to pass the correct encoding to the constructor of the InputStreamReader.

Trying to fix the problem "after the fact" will be

  1. harder to do and
  2. often not even possible (because decoding a byte[] with the wrong encoding can be a destructive operation).


I hope this will solve your problem.

String readable = "äöüÄÖÜßáéíóúÁÉÍÓÚàèìòùÀÈÌÒÙñÑ";

try {
    String unreadable = new String(readable.getBytes("UTF-8"), "ISO-8859-15");
    // unreadable -> äöüÃÃÃÃáéíóúÃÃÃÃÃàèìòùÃÃÃÃÃñÃ
} catch (UnsupportedEncodingException e) {
    // handle error
}

And:

String unreadable = "äöüÃÃÃÃáéíóúÃÃÃÃÃàèìòùÃÃÃÃÃñÃ";

try {
    String readable = new String(unreadable.getBytes("ISO-8859-15"), "UTF-8");
    // readable -> äöüÄÖÜßáéíóúÁÉÍÓÚàèìòùÀÈÌÒÙñÑ
} catch (UnsupportedEncodingException e) {
    // ...
}


String s = new String("Üü?öäABC".getBytes(), "ISO-8859-15"); //bug

All this code does is corrupt data. It transcodes UTF-16 data to the system encoding (whatever that is) and the takes those bytes, pretends they're valid ISO-8859-15 and transcodes them to UTF-16.

Then how to convert an input String like "ÃÃŒ?öÀABC" to normal? (if I know that the string is from an ISO8859 file).

The correct way to perform this operation would be like this:

byte[] iso859_15 = { (byte) 0xc3, (byte) 0xc3, (byte) 0xbc, 0x3f,
  (byte) 0xc3, (byte) 0xb6, (byte) 0xc3, (byte) 0xa4, 0x41, 0x42,
         0x43 };
String utf16 = new String(iso859_15, Charset.forName("ISO-8859-15"));

Strings in Java are always UTF-16. All other encodings must be represented using the byte type.

Now, if you use System.out to output the resultant string, that might not appear correctly, but that is a different transcoding issue. For example, the Windows console default encoding doesn't match the system encoding. The encoding used by System.out must match the encoding of the device receiving the data. You should also take care to ensure that you are reading your source files with the same encoding your editor is using.

To understand how treatment of character data varies between languages, read this.


Here is an easy way with String output (I created a method to do this):

public static String (String input){
String output = "";
try {
    /* From ISO-8859-1 to UTF-8 */
    output = new String(input.getBytes("ISO-8859-1"), "UTF-8");
    /* From UTF-8 to ISO-8859-1 */
    output = new String(input.getBytes("UTF-8"), "ISO-8859-1");
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}
return output;

}

// Example
input = "Música";
output = "Música";

it works!! :)


Java Strings are internally always stored as UTF16 arrays (and as UTF8 in the class file after compliation), so you can't simply interpret a string as if it was a byte array. If you want to create a byte array from a string in a certain encoding, you must first convert into this encoding:

byte[] b = "Üü?öäABC".getBytes("ISO-8859-15");

System.out.println(new String(b, "ISO-8859-15")); // will be ok
System.out.println(new String(b, "UTF-8")); // will look garbled


this solution work for me i hope that will help you

String s1 = "l'épargne";
String s2 = new String(s1.getBytes("iso-8859-1"), "utf8");


I'd like to provide the extended set of characters in order to validate converted strings from ISO-8859-1 into utf-8.

@Test
public void testEnc() throws UnsupportedEncodingException {
    String isoString = "äö";
    String utfString = new String(isoString.getBytes("ISO-8859-1"), "utf-8");
    boolean validConvertion = containsSpecialCharacter(utfString);
    assertTrue(validConvertion);
}

public boolean containsSpecialCharacter(String str) {
    String[] readable = new String[] { "Ã", "Ã", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ü", "Ã", "Þ", "ß",
            "à", "á", "â", "ã", "ä", "å", "æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ", "ö",
            "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ" };
    for (String st : readable) {
        if (str.contains(st)) {
            return true;
        }
    }
    return false;
}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号