I have some text encoded in IS开发者_开发技巧O-8859-1 which I then extract some data from using Regex.
The problem is that the strings I get from the matcher object is in the wrong format, scrambling chars like "ÅÄÖ".
How do I stop the regex library from scrambling my chars?
Edit: Here's some code:
private HttpResponse sendGetRequest(String url) throws ClientProtocolException, IOException
{
HttpGet get = new HttpGet(url);
return hclient.execute(get);
}
private static String getResponseBody(HttpResponse response) throws IllegalStateException, IOException
{
InputStream input = response.getEntity().getContent();
StringBuilder builder = new StringBuilder();
int read;
byte[] tmp = new byte[1024];
while ((read = input.read(tmp))!=-1)
{
builder.append(new String(tmp), 0,read-1);
}
return builder.toString();
}
HttpResponse response = sendGetRequest(url);
String html = getResponseBody(response);
Matcher matcher = forum_pattern.matcher(html);
while(matcher.find()) // do stuff
This is probably the immediate cause of your problem, and it's definitely an error:
builder.append(new String(tmp), 0, read-1);
When you call one of the new String(byte[])
constructors that doesn't take a Charset, it uses the platform default encoding. Apparently, the default encoding on your your platform is not ISO-8859-1. You should be able to get the charset name from the response headers so you can supply it to the constructor.
But you shouldn't be using a String constructor for this anyway; the proper way is to use an InputStreamReader. If the encoding were one of the multi-byte ones like UTF-8, you could easily corrupt the data because a chunk of bytes happened to end in the middle of a character.
In any case, never, ever use a new String(byte[])
constructor or a String.getBytes()
method that doesn't accept a Charset parameter. Those methods should be deprecated, and should emit ferocious warnings when anyone uses them.
It's html from a website.
Use a HTML parser and this problem and all future potential problems will disappear.
I can recommend picking Jsoup for the job.
See also:
- Regular Expressions - Now you have two problems
- Parsing HTML - The Cthulhu way
- Pros and cons of HTML parsers in Java
精彩评论