In my application, I read tweets from twitter, but the tweets are not language restricted. So when I am trying to send response for a Chinese/Japanese tweet the content is not displayed correctly. I have currently set the
response.setContentType("text/html;charset=UTF-8");
before sending the resp开发者_StackOverflowonse.
How can we handle multiple languages?
i can see the message sent
{"lastPost":{"lastUpdate":"毋成金口","pubDate":"Fri Aug 12 00:39:09 UTC 2011","message_id":101814948329562112}
this is a json string and added to the response..
on my client i.e iphone the lastpost is "????"
Telling the browser that the page is UTF-8 is a good thing, but useless unless you make sure that you are actually writing only UTF-8 in the page.
To make sure this happens :
- Whenever you read, from twitter or whatever, always require UTF-8 data, make sure you are receiving UTF-8 bytes.
- When you create a string from raw bytes, Java by default uses the "platform default encoding" which could be anything. Bytes to string conversion happens when creating a new String from a byte array or when using a Reader. Both these methods allow you to explicitly define the encpding you expect the bytes to be. Once point 1 is checked and you are receiving UTF-8 byes, make sure everywhere in your application you are specifying to use UTF-8 when converting bytes to strings.
- when using a Writer, to convert strings to bytes sent for example to the browser (the servlet writer), the same rules apply : try to be explicit and always specify UTF-8
- If you store stuff in databases, then you have two encoding problems. The first one is which encpding you database is using when talking to your application (connection encoding), the second is which encoding the database is actually storing strings in (storage encoding). Usually, you can specify only the connection encoding from Java, while the storage encoding is specified in the database when it is created (search for "collation" if you are using mysql).
Detecting where a string that is supposed to be UTF-8 gets reencoded badly is a hard task. 99% of the times, it is being converted to ISO-latin or similar encoding somewhere, which causes special characters like à or ì appear as two chars of garbage. Often debugging is the only way to find out where this happens.
the problem was with the client encoding.. it was set to ISO-
精彩评论