I thought characters in java are 16 bits as suggested in java doc. Isn't it the case for strings? I have a code that stores an object into a file:
public static void storeNormalObj(File outFile, Object obj) {
FileOutputStream fos = null;
开发者_开发百科 ObjectOutputStream oos = null;
try {
fos = new FileOutputStream(outFile);
oos = new ObjectOutputStream(fos);
oos.writeObject(obj);
oos.flush();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
oos.close();
try {
fos.close();
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Basically, I tried to store an string "abcd"
in to file "output"
, when I opened up output
with an editor and deleted the none string part, what's left is just the string "abcd", which is 4 bytes in total. Anyone knows why? Does java automatically saves space by using ASCII instead of UNICODE for Strings that can be supported by ASCII? Thanks
(I think by "none string part" you are referring to the bytes that ObjectOutputStream emits when you create it. It is possible you don't want to use ObjectOutputStream, but I don't know your requirements.)
Just FYI, Unicode and UTF-8 are not the same thing. Unicode is a standard that specifies, amongst other things, what characters are available. UTF-8 is a character encoding that specifies how these characters shall be physically encoded in 1s and 0s. UTF-8 can use 1 byte for ASCII (<= 127) and up to 4 bytes to represent other Unicode characters.
UTF-8 is a strict superset of ASCII. So even if you specify a UTF-8 encoding for a file and you write "abcd" to it, it will contain just those four bytes: they have the same physical encoding in ASCII as they do in UTF-8.
Your method uses ObjectOutputStream
which actually has a significantly different encoding than either ASCII or UTF-8! If you read the Javadoc carefully, if obj
is a string and has already occurred in the stream, subsequent calls to writeObject
will cause a reference to the previous string to be emitted, potentially causing many fewer bytes to be written in the case of repeated strings.
If you're serious about understanding this, you really should spend a good amount of time reading about Unicode and character encoding systems. Wikipedia has an excellent article on Unicode as a start.
Yea, the char
is only Unicode within the context of the Java runtime environment. If you wish to write it using 16-bit encoding, use a FileWriter
.
FileWriter outputStream = null;
try {
outputStream = new FileWriter("myfilename.dat");
int c;
while ((c = inputStream.read()) != -1) {
outputStream.write(c);
}
} finally {
if (outputStream != null) {
outputStream.close();
}
}
If you look at the source of String, it will note that it calls DataOutput.writeUTF to write Strings. And if you read that you'll find out they are written as "modified UTF-8". The details are lengthy, but if you don't use non 7 bit ascii, yes, it will take one byte. If you want the gory details look at the EXTREMELY long javadoc in DataOutput.writeUTF()
You may be interested to know there is a -XX:+UseCompressedStrings
option in Java Update 21 performance release and later. This will allows String to use a byte[]
for strings which do not need a char[]
Despite the Java Hotspot VM Options guide suggesting it may be on by default, this may only be for performance releases. It only appears to work for me if I turn it on explicitly.
So do you expect a 16*4=64 bits = 8 bytes
file? More than UTF-8 or ASCII encoding. Once the file is written to a file. The memory (in terms of space) management is up to the operating system. And your code doesn't have a control on it.
精彩评论