Java UTF-8 differences_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-14 11:57 出处：网络

The JavaDoc says开发者_JAVA技巧 \"The null byte \'\\u0000\' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls.\"

相关专题：utf-8

The JavaDoc says开发者_JAVA技巧 "The null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls."

But what does this even mean? What's an embedded null in this context? I am trying to convert from a Java saved UTF-8 string to "real" UTF-8.

In C a string is terminated by the byte value 00.

The thing here is that you can have 0-chars in Java strings but to avoid confusion when passing the string over to C (which all native methods are written in) the character is encoded in another way, namely as two bytes

11000000 10000000

(according to the javadoc) neither of which is actually 00.

This is a hack to work around something you cannot change easily.

Also note, that this is valid UTF-8 and decode correctly to 00.

No "embedded nulls" means that the raw data does not contain a single 0x00 (NULL) byte.

\u0000 gets encoded to (binary) 11000000 10000000, (hex) 0xC080.

That's not a Java-wide difference, only in DataInput/OutputStream. If the string data was written using DataOutputStream then just read it in using DataInputStream.

If you need to write the string data to, say, a file, don't use DataOutputStream, use a Writer, which is meant for character streams.

This is only for the method writeUTF of DataOutputStream, not for normal converted streams (OutputStreamWriter or such).

It means that if you have a string "\u0000", it will be encoded as 0xC0 0x80 instead of simply 0x00.

And in the other way around, this sequence 0xB0 0x80, which will never occur in normal UTF-8 strings, represents a nul character.

Also, the documentation you linked seems to be from the time when Unicode still was a 16-bit character set - nowadays it also allows characters over 0xFFFF, which will be represented by two Java char values each (in UTF-16 format, a surrogate pair), and will need 4 bytes in UTF-8, if I calculated right. I'm note sure about the implementation here, though - it looks like these are simply written in CESU-8 format (e.g. two 3-byte sequences, each corresponding to a UTF-16 surrogate, which together give one Unicode character). You will have to take care of this, too.

If you are using Java, the simplest thing would be to use DataInputStream to read this into a string, and then convert it (with getBytes("UTF-8") or a OutputStreamWriter to real UTF-8 data.

If you are having difficulty reading a "saved" Java string, you need to look at the specification for the methods that read/write in that format:

If the string was written using DataOutput.writeUTF8, the DataInput.readUTF8() javadoc is a definitive spec. In addition to the non-standard handling of NUL, it specifies that the string starts with an unsigned 16-bit byte count.
If the string was written using ObjectOutputStream.writeObject() then the serialization spec is definitive.