I've a bunch of chinese characters in say DB or XML file. They are stored there using UTF-8
encoding.
And now i need to get this information in my Java code. I read the XML using DOM parser
and stored the chinese character in a String literal
. This is later displayed in the JSP Page and printed in the System out console
.
It is开发者_运维知识库 working fine. I do not know why?
As per my understanding, Java should use the proper encoding (in this case UTF-8
) to store the Chinese character. But when I checked the default encoding used by JVM it is not UTF-8 or 16
. It is some Cp1522(not sure if this is correct, I cannot recollect the correct value, my apologies).
So it should not be able to print the values right? Could you please help to know why this is working?
The "default" you refer to is probably the "platform default", which is used when no other encoding information is available, but only for reading character streams into or out of the JVM. Once inside the JVM, all characters are represented in UTF-16. The encoding you mentioned is probably Cp1252. It would be impossible to represent Chinese characters in this encoding, so that's not what's happening. You'd have to be more specific about what's happening, but the XML parser you're using is probably detecting the correct encoding to use and thus not garbling it.
Assuming everything is working, this is how it'd work:
Your XML parser decodes the XML and converts it to Java's internal representation (effectively UTF-16 -- a Java char
is actually a UTF-16 code unit, not a "character").
When you render a JSP it's encoding the page based on your Servlet container configuration. The HTTP headers probably include the encoding being used, so your browser can decode it correctly.
Here's where it becomes unclear whether things really are working. What ends up in System.out
depends on how you're writing to it. You say "printed", so I'm guessing you're using the print
methods, which means the platform's default character encoding is being used. If this encoding really is CP-1252 (the only one I can think of that sounds like Cp1522) and the result looks "right", then actually something is wrong.
CP-1252 is essentially Latin-1, which is sometimes abused into being treated as "bytes == chars". That would suggest that your multi-byte Chinese characters are actually being converted into multiple Java chars
. This would only be correct behavior in the case of non-BMP/plane-0 characters, and in that case these character should become a surrogate pair.
To test what's going on, try putting the two characters 你好 into your XML and testing the length of the parsed String
. The length should be 2 (those are both BMP characters). If the length is something bigger (probably 6) then you're decoding incorrectly and things only seem to work because you're re-encoding the same (wrong) way.
I will recommend you check your default IDE workspace encoding setting to "UTF-8". Otherwise it will change the encoding when you modify the xml files.
Anyway you seems to be more interested in how DOMParser works. But DOMParser can decide its encoding. It probably uses its own default encoding. You can debug into it and see what encoding it is using.
精彩评论