I am facing a problem about encoding.
For example, I have a message in XML, whose format encoding is "UTF-8".
<message>
<product_name>apple</product_name>
<price>1.3</price>
<product_name>orange</product_name>
<price>1.2</price>
.......
</message>
Now, this message is supporting 开发者_C百科multiple languages:
- Traditional Chinese (big5),
- Simple Chinese (gb),
English (utf-8)
And it will only change the encoding in specific fields.
For example (Traditional Chinese),蘋果 1.3 橙 1.2 .......
Only "蘋果" and "橙" are using big5,
"<product_name>"
and"</product_name>"
are still using utf-8.<price>1.3</price>
and<price>1.2</price>
are using utf-8.
How do I know which word is using different encoding?
It looks like whoever is providing the XML is providing incorrect XML. They should be using a consistent encoding.
http://sourceforge.net/projects/jchardet/files/ is a pretty good heuristic charset detector. It's a port of the one used in Firefox to detect the encoding of pages that are missing a charset in content-type or a BOM.
You could use that to try and figure out the encoding for substrings in a malformed XML file if you can't get the provider to fix their output.
you should use only one encoding in one xml file. there are counterparts of the characters of big5 in the UTF_8 encoding.
Because I cannot get the provider to fix the output, so I should be handle it by myself and I cannot use the extend library in this project.
I only can solve that like this,
String str = new String(big5String.getByte("UTF-8"));
before display the message.
精彩评论