Regex and ISO-8859-1 charset in java_问答_开发者

开发者 https://www.devze.com 2023-01-10 20:21 出处：网络

I have some text encoded in IS开发者_开发技巧O-8859-1 which I then extract some data from using Regex.

The problem is that the strings I get from the matcher object is in the wrong format, scrambling chars like "ÅÄÖ".

How do I stop the regex library from scrambling my chars?

Edit: Here's some code:

private HttpResponse sendGetRequest(String url) throws ClientProtocolException, IOException
{
    HttpGet get = new HttpGet(url);
    return hclient.execute(get);
}
private static String getResponseBody(HttpResponse response) throws IllegalStateException, IOException
{
    InputStream input = response.getEntity().getContent();
    StringBuilder builder = new StringBuilder();
    int read;
    byte[] tmp = new byte[1024];

    while ((read = input.read(tmp))!=-1)
    {
        builder.append(new String(tmp), 0,read-1);
    }

    return builder.toString();
}
HttpResponse response = sendGetRequest(url);
String html = getResponseBody(response);
Matcher matcher = forum_pattern.matcher(html);
while(matcher.find()) // do stuff

This is probably the immediate cause of your problem, and it's definitely an error:

builder.append(new String(tmp), 0, read-1);

When you call one of the new String(byte[]) constructors that doesn't take a Charset, it uses the platform default encoding. Apparently, the default encoding on your your platform is not ISO-8859-1. You should be able to get the charset name from the response headers so you can supply it to the constructor.

But you shouldn't be using a String constructor for this anyway; the proper way is to use an InputStreamReader. If the encoding were one of the multi-byte ones like UTF-8, you could easily corrupt the data because a chunk of bytes happened to end in the middle of a character.

In any case, never, ever use a new String(byte[]) constructor or a String.getBytes() method that doesn't accept a Charset parameter. Those methods should be deprecated, and should emit ferocious warnings when anyone uses them.