I am having weird character encoding issues with a JSON array that is grabbed from a web page. The server is sending back this header:
Content-Type text/javascript; charset=UTF-8
Also I can look at the JSON output in Firefox or any browser and Unicode characters display properly. The response will sometimes contain words from another language with accent symbols and such. However I am getting those weird question marks when I pull it down and put it to a string in Java. Here is my code:
HttpParams params = new BasicHttpParams();
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1);
HttpProtocolParams.setContentCharset(params, "utf-8");
params.setBooleanParameter("http.protocol.expect-continue", false);
HttpClient httpclient = new DefaultHttpClient(params);
HttpGet httpget = new HttpGet("http://www.example.com/json_array.php");
HttpResponse response;
try {
response = httpclient.execute(httpget);
if(response.getStatusLine().getStatusCode() == 200){
// Connection was established. Get the content.
HttpEntity entity = response.getEntity();
// If the response does not enclose an entity, there is no need
// to worry about connection release
if (entity != null) {
// A Simple JSON Response Read
InputStream instream = entity.getContent();
String jsonText = convertStreamToString(instream);
Toast.makeText(getApplicationContext(), "Response: "+jsonText, Toast.LENGTH_LONG).show();
}
}
} catch (MalformedURLException e) {
Toast.makeText(getApplicationContext(), "ERROR: Malformed URL - "+e.getMessage(), Toast.LENGTH_LONG).show();
e.printStackTrace();
} catch (IOException e) {
Toast.makeText(getApplicationContext(), "ERROR: IO Exception - "+e.getMessage(), Toast.LENGTH_LONG).show();
e.printStackTrace();
} catch (JSONException e) {
Toast.makeText(getApplicationContext(), "ERROR: JSON - "+e.getMessage(), Toast.LENGTH_LONG).show();
e.printStackTrace();
}
private static String convertStreamToString(InputStream is) {
/*
* To convert the InputStream to String we use the BufferedReader.readLine()
* method. We iterate until the BufferedReader return null which means
* there's no more data to read. Each line will appended to a StringBuilder
* and returned as String.
*/
BufferedReader reader;
try {
reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
} catch (UnsupportedEncodingException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
StringBuilder sb = new StringBuilder();
String line;
try {
while ((line = reader.readLine()) != null) {
sb.append(line + "\n");
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
is.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return sb.toString();
}
As you can see, I am specifying UTF-8 on the InputStreamReader but every time I view the returned JSON text via Toast it has strange question marks. I am thinking that I need to send the InputStream to a byte[] instead?
Thanks in开发者_JAVA百科 advance for any help.
Try this:
if (entity != null) {
// A Simple JSON Response Read
// InputStream instream = entity.getContent();
// String jsonText = convertStreamToString(instream);
String jsonText = EntityUtils.toString(entity, HTTP.UTF_8);
// ... toast code here
}
@Arhimed's answer is the solution. But I cannot see anything obviously wrong with your convertStreamToString
code.
My guesses are:
- The server is putting a UTF Byte Order Mark (BOM) at the start of the stream. The standard Java UTF-8 character decoder does not remove the BOM, so the chances are that it would end up in the resulting String. (However, the code for EntityUtils doesn't seem to do anything with BOMs either.)
- Your
convertStreamToString
is reading the character stream a line at a time, and reassembling it using a hard-wired'\n'
as the end-of-line marker. If you are going to write that to an external file or application, you should probably should be using a platform specific end-of-line marker.
It is just that your convertStreamToString is not honoring encoding set in the HttpRespnose. If you look inside EntityUtils.toString(entity, HTTP.UTF_8)
, you will see that EntityUtils find out if there is encoding set in the HttpResponse first, then if there is, EntityUtils use that encoding. It will only fall back to the encoding passed in the parameter(in this case HTTP.UTF_8) if there isn't encoding set in the entity.
So you can say that your HTTP.UTF_8 is passed in the parameter but it never get used because it is the wrong encoding. So here is update to your code with the helper method from EntityUtils.
HttpEntity entity = response.getEntity();
String charset = getContentCharSet(entity);
InputStream instream = entity.getContent();
String jsonText = convertStreamToString(instream,charset);
private static String getContentCharSet(final HttpEntity entity) throws ParseException {
if (entity == null) {
throw new IllegalArgumentException("HTTP entity may not be null");
}
String charset = null;
if (entity.getContentType() != null) {
HeaderElement values[] = entity.getContentType().getElements();
if (values.length > 0) {
NameValuePair param = values[0].getParameterByName("charset");
if (param != null) {
charset = param.getValue();
}
}
}
return TextUtils.isEmpty(charset) ? HTTP.UTF_8 : charset;
}
private static String convertStreamToString(InputStream is, String encoding) {
/*
* To convert the InputStream to String we use the
* BufferedReader.readLine() method. We iterate until the BufferedReader
* return null which means there's no more data to read. Each line will
* appended to a StringBuilder and returned as String.
*/
BufferedReader reader;
try {
reader = new BufferedReader(new InputStreamReader(is, encoding));
} catch (UnsupportedEncodingException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
StringBuilder sb = new StringBuilder();
String line;
try {
while ((line = reader.readLine()) != null) {
sb.append(line + "\n");
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
is.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return sb.toString();
}
Archimed's answer is correct. However, that can be done simply by providing an additional header in the HTTP request:
Accept-charset: utf-8
No need to remove anything or use any other library.
For example,
GET / HTTP/1.1
Host: www.website.com
Connection: close
Accept: text/html
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.10 Safari/537.36
DNT: 1
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Accept-Charset: utf-8
Most probably your request doesn't have any Accept-Charset
header.
Extract the charset from the response content type field. You can use the following method to do this:
private static String extractCharsetFromContentType(String contentType) {
if (TextUtils.isEmpty(contentType)) return null;
Pattern p = Pattern.compile(".*charset=([^\\s^;^,]+)");
Matcher m = p.matcher(contentType);
if (m.find()) {
try {
return m.group(1);
} catch (Exception e) {
return null;
}
}
return null;
}
Then use the extracted charset to create the InputStreamReader
:
String charsetName = extractCharsetFromContentType(connection.getContentType());
InputStreamReader inReader = (TextUtils.isEmpty(charsetName) ? new InputStreamReader(inputStream) :
new InputStreamReader(inputStream, charsetName));
BufferedReader reader = new BufferedReader(inReader);
精彩评论