I read that Java uses UTF-16 encoding internally. i.e. I understand that if I have like: String var = "जनमत"; then the "जनमत" will be encoded in UTF-16 internally. So, If I dump this va开发者_Go百科riable to some file such as below:
fileOut = new FileOutputStream("output.xyz");
out = new ObjectOutputStream(fileOut);
out.writeObject(var);
will the encoding of the string "जनमत" in the file "output.xyz" be in UTF-16? Also, later on if I want to read from the file "output.xyz" via ObjectInputStream, will I be able to get the UTF-16 representation of the variable?
Thanks.
So, If I dump this variable to some file... will the encoding of the string "जनमत" in the file "output.xyz" be in UTF-16?
The encoding of your string in the file will be in whatever format the ObjectOutputStream
wants to put it in. You should treat it as a black box that can only be read by an ObjectInputStream
. (Seriously - even though the format is IIRC well-documented, if you want to read it with some other tool, you should serialise the object yourself as XML or JSON or whatever.)
Later on if I want to read from the file "output.xyz" via ObjectInputStream, will I be able to get the UTF-16 representation of the variable?
If you read the file with an ObjectInputStream
, you'll get a copy of the original object back. This will include a java.lang.String
, which is a just stream of characters (not bytes) - from which you could get the UTF-16 representation if you wished via the getBytes() method (though I suspect you don't actually need to).
In conclusion, don't worry too much about the internal details of serialization. If you need to know what's going on, create the file yourself; and if you're just curious, trust in the JVM to do the right thing.
Close: it is not exactly UTF-16, but something like UCS-2; but either way it does use 2 bytes for most characters (and sequence of 2 chars, i.e. 4 bytes for some rarely used code points).
ObjectOutputStream uses something called modified UTF-8, which is like UTF-8 but where zero character is expressed as 2-byte sequence which is not legal as per UTF-8 (due to uniqueness restrictions of encoding), but that sort of naturally decodes back to value 0.
But what you are really asking is "does it work so that I write a String, read a String" -- and answer to that is yes. JDK does proper encoding when writing bytes out, and decoding when reading.
For what it's worth, you are better of using "writeUTF()" method for Strings, since I think resulting output is bit more compact. but "writeObject()" also works, just needs bit more metadata.
Just to add on this, ObjectOutputStream.writeString()
will determing the UTF length of a given string and write it in "standard" UTF or in "long" UTF format where "long" as stated in the javadoc
"Long" UTF format is identical to standard UTF, except that it uses an 8 byte header (instead of the standard 2 bytes) to convey the UTF encoding length.
I got this from code...
private void writeString(String str, boolean unshared) throws IOException {
handles.assign(unshared ? null : str);
long utflen = bout.getUTFLength(str);
if (utflen <= 0xFFFF) {
bout.writeByte(TC_STRING);
bout.writeUTF(str, utflen);
} else {
bout.writeByte(TC_LONGSTRING);
bout.writeLongUTF(str, utflen);
}
}
and in writeObject(Object obj)
they do a check
if (obj instanceof String) {
writeString((String) obj, unshared);
}
精彩评论