开发者

Cannot parse and display non-utf8 characters read from an http request

开发者 https://www.devze.com 2022-12-11 19:37 出处:网络
I\'m using Java to parse this request http://ajax.googleapis.com/ajax/services/search/web?start=0&rsz=large&v=1.0&q=rz+img+news+recordid+border

I'm using Java to parse this request

http://ajax.googleapis.com/ajax/services/search/web?start=0&rsz=large&v=1.0&q=rz+img+news+recordid+border

which has as 开发者_Python百科a result this (truncated for the sake of brevity) JSON file:

{"responseData":{"results":
<...>
"visibleUrl":"www.coolcook.net",
"cacheUrl":"http://www.google.com/search?q\u003dcache:p4Ke5q6zpnUJ:www.coolcook.net",
"title":"مطبخ مطايب - كباب الدجاج والخضار بصلصة الروب",
"titleNoFormatting":"مطبخ مطايب - كباب الدجاج والخضار بصلصة الروب","\u003drz+img+news+recordid+border"}}, 
<...>
"responseDetails": null, "responseStatus": 200}

My problem lies in the arabic characters returned (which could be any non-unicode for that matter). I tried to convert them back to unicode using something like:

JSONArray ja = json.getJSONObject("responseData").getJSONArray("results");
JSONObject j = ja.getJSONObject(i);
str = j.getString("titleNoFormatting");
logger.log("before: " + str); // this is just my version of println
enc_str = new String (str.getBytes(), "UTF8");
logger.log("after: " + enc_str);

However, both the 'before' and 'after' results are the same: a set of ????'s, regardless of whether I output them in the server log file or in an HTML page. Is there another way to get back the arabic characters and output them in a webpage?

Does JSON have any supporting functionality for this sort of problem perhaps in order to read the non-utf characters straight away from the JSONObject?


The issue you have is most likely caused by incorrect setting of the character encoding at the point that you are reading in the http response from google. Can you post the code that actually gets URL and parses it into the JSON object?

As an example run the following:

public class Test1 {
  public static void main(String [] args) throws Exception {

    // just testing that the console can output the correct chars
    System.out.println("\"title\":\"مطبخ مطايب - كباب الدجاج والخضار بصلصة الروب");

    URL url = new URL("http://ajax.googleapis.com/ajax/services/search/web?start=0&rsz=large&v=1.0&q=rz+img+news+recordid+border");
    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    InputStream is  = connection.getInputStream();

    // the important bit is here..........................\/\/\/
    InputStreamReader reader = new InputStreamReader(is, "utf-8");


    StringWriter sw = new StringWriter();

    char [] buffer = new char[1024 * 8];
    int count ;

    while( (count = reader.read(buffer)) != -1){
      sw.write(buffer, 0, count);
    }

    System.out.println(sw.toString());
  }
}

This is using the rather ugly standard URL.openConnection() that's been around since the dawn of time. If you are using something like Apache httpclient then you can do this really easily.

For a bit of back ground reading on encoding and maybe an explaination of why new String (str.getBytes(), "UTF8"); will never work read Joel's article on unicode


I think the JSON.org Java JSON package cannot handle UTF8, whether it is passed in as a UTF8 character or actually passing in the \uXXXX code. I tried both as follows:

import org.json.
public class JsonTest extends TestCase {
    public void testParseText() {
        try {
            JSONObject json1 = new JSONObject("{\"a\":\"\u05dd\"}"); // \u05dd is a Hebrew character
            JSONObject json2 = new JSONObject("{\"a\":\"\\u05dd\"}"); // \u05dd is a Hebrew character
            System.out.println(json1.toString());
            System.out.println(json2.toString());
        } catch (JSONException e) {
            e.printStackTrace();
        }
    }
}

I get:

{"a":"?"}
{"a":"?"}

Any ideas?


The important part of the problem is how you are handling the content of the HTTP response. That is, how are you creating the json object? By the time you get to the code in your original post, the content has already been corrupted.

The request results in UTF-8 encoded data. How are you parsing it into JSON objects? Is the correct encoding specified to the decoder? Or is your platform's default character encoding being used?


First try this:

str = j.getString("titleNoFormatting");
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("c:/test.txt"), "UTF-8"));
writer.write(str);
writer.close();

Then open the file in notepad. If this looks fine, then the problem lies in your logger or console that it's not configured to use UTF-8. Else the problem most likely lies in the JSON API which you used that it's not configured to use UTF-8.

Edit: if the problem is actually in the JSON API used and you don't know which to choose, then I'd recommend to use Gson. It really eases converting a Json string to a easy-to-use javabean. Here's a basic example:

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.List;

import com.google.gson.Gson;

public class Test {

    public static void main(String[] args) throws Exception {
        URL url = new URL("http://ajax.googleapis.com/ajax/services/search/web"
            + "?start=0&rsz=large&v=1.0&q=rz+img+news+recordid+border");
        BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
        GoogleResults results = new Gson().fromJson(reader, GoogleResults.class);

        // Show all results.
        System.out.println(results);

        // Show title of 1st result (is arabic).
        System.out.println(results.getResponseData().getResults().get(0).getTitle());
    }

}

class GoogleResults {

    ResponseData responseData;
    public ResponseData getResponseData() { return responseData; }
    public void setResponseData(ResponseData responseData) { this.responseData = responseData; }
    public String toString() { return "ResponseData[" + responseData + "]"; }

    static class ResponseData {
        List<Result> results;
        public List<Result> getResults() { return results; }
        public void setResults(List<Result> results) { this.results = results; }
        public String toString() { return "Results[" + results + "]"; }
    }

    static class Result {
        private String url;
        private String title;
        public String getUrl() { return url; }
        public String getTitle() { return title; }
        public void setUrl(String url) { this.url = url; }
        public void setTitle(String title) { this.title = title; }
        public String toString() { return "Result[url:" + url +",title:" + title + "]"; }
    }

}

It outputs the results nicely. Hope this helps.


There is a library which retains the encoding of the http response (Czech expressions) with JSon message like this :

private static String inputStreamToString(final InputStream inputStream) throws Exception {
 final StringBuilder outputBuilder = new StringBuilder();

 try {
  String string;
  if (inputStream != null) {
   BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"));
   while (null != (string = reader.readLine())) {
    outputBuilder.append(string).append('\n');
   }
  }
 } catch (Exception ex) {
  throw new Exception("[google-api-translate-java] Error reading translation stream.", ex);
 }

 return outputBuilder.toString();
}

The answer is tricky and there are a few points one must pay attention to, mainly to platform encoding:

afaik affects printing out to console, creating files from an inputstream and even communication between DB client and server even though they are both set to use utf-8 charset for encoding - no matter whether I explicitly create utf-8 string, inputstreamReader or set JDBC driver for UTF-8, still setting up $LANG property to xx_XX.UTF-8 on linux systems and add append=" vt.default_utf8=1" to LILO boot loader (on systems that use it), must be done at least for systems running database and java apps working with utf-8 encoded files.

Even if I append this JVM parameter -Dfile.encoding=UTF-8, without the platform encoding I didn't succeed in properly encoded streams. Having JDBC connector set up properly is necessary : "jdbc:mysql://localhost/DBname?useUnicode=true&characterEncoding=UTF8", if you are going to persist the strings to a database, which should be in this state:

    mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | utf8   |
| character_set_connection | utf8   |
| character_set_database   | utf8   |
| character_set_filesystem | binary |
| character_set_results    | utf8   |
| character_set_server     | utf8   |
| character_set_system     | utf8   |
+--------------------------+--------+


The Google API correctly sends UTF-8. I think the problem is that your default encoding is not capable outputting Arabic. Check your file.encoding property or get encoding like this,

public static String getDefaultCharSet() throws IOException {
    OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
    return writer.getEncoding();
}

If the default encoding is ASCII or Latin-1, you will get "?"s. You need to change it into UTF-8.

0

精彩评论

暂无评论...
验证码 换一张
取 消