how to avoid memory wastage when storing UTF-8 characters (8 bit) in Java character (16 bit). two in one?_问答_开发者

I'm afraid I have a question on a detail of a rather oversaturated topic, I searched aroudn a lot, but couldn't find a clear answer to that specific obvious -imho- important, problem:

When converting byte[] to String using UTF-8, each byte (8bit) becomes a 8 开发者_如何学JAVAbit character encoded by UTF-8, but each UTF-8 character is saved as a 16 bit character in java. Is that correct? If yes, this means, that each stupid java character only uses the first 8 bits, and consumes double the memory? Is that correct too? I wonder how this wasteful behaviour is acceptable..

Isn't there some trick to have a pseudo String that is 8 bit? Would that actually result in less memory consumption? Or maybe, is there a way to store >two< 8bit characters in one java 16bit character to avoid this memory waste?

thanks for any deconfusing answers...

EDIT: hi, thanks everybody for answering. I was aware of the variable-length property of UTF-8. However, since my source is byte which is 8 bit, I understood (apparently wrongly) that it needs only 8-bit UTF-8 words. Is UTF-8 conversion actually saving the strange symbols that you see when on the CLI you do "cat somebinary" ? I thought UTF-8 was just somehow used to map each of the possible 8bit words of byte to one particular 8 bit word of UTF-8. Wrong? I thought about using Base64 but it's bad because it uses only 7 bit..

questions reformulated: is there a smarter way to convert byte to something String? May favorite was to just cast byte[] to char[], but then I still have 16bit words.

additional use case info:

I'm adapting Jedis (java client for the NoSQL Redis) as the "primitive storage layer" for hypergraphDB. So, jedis is a database for another "database". My problem is that I have to feed jedis with byte[] data all the time, but internally, >Redis< (the actual server) is dealing only with "binary safe" Strings. Since Redis is written in C, a char is 8 bit long, AFAIK not ASCIII which is 7 bit. In Jedis however, java world, every character is 16 bit long internally. I don't understand this code (yet), but I suppose jedis then converts this java 16 bit strings to a Redis conforming 8 bit string (([here][3]). It says it extends FilterOutputStream. My hope is to bypass the byte[] <-> string conversion altogether and use that Filteroutputstream...? )

now I wonder: if I had to interconvert byte[] and String all the time, with datasizes ranging from very small to potentially very big, isn't there a huge waste of memory to have each 8 bit character passed around as 16bit within java?

Isn't there some trick to have a pseudo String that is 8 bit?

yes, make sure you have an up to date version of Java. ;)

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

-XX:+UseCompressedStrings Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

EDIT: This option doesn't work in Java 6 update 22 and is not on by default in Java 6 update 24. Note: it appears this option may slow performance by about 10%.

The following program

public static void main(String... args) throws IOException {
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 10000; i++)
        sb.append(i);

    for (int j = 0; j < 10; j++)
        test(sb, j >= 2);
}

private static void test(StringBuilder sb, boolean print) {
    List<String> strings = new ArrayList<String>();
    forceGC();
    long free = Runtime.getRuntime().freeMemory();

    long size = 0;
    for (int i = 0; i < 100; i++) {
        final String s = "" + sb + i;
        strings.add(s);
        size += s.length();
    }
    forceGC();
    long used = free - Runtime.getRuntime().freeMemory();
    if (print)
        System.out.println("Bytes per character is " + (double) used / size);
}

private static void forceGC() {
    try {
        System.gc();
        Thread.sleep(250);
        System.gc();
        Thread.sleep(250);
    } catch (InterruptedException e) {
        throw new AssertionError(e);
    }
}

Prints this by default

Bytes per character is 2.0013668655941212
Bytes per character is 2.0013668655941212
Bytes per character is 2.0013606946433575
Bytes per character is 2.0013668655941212

with the option -XX:+UseCompressedStrings

Bytes per character is 1.0014671435440285
Bytes per character is 1.0014671435440285
Bytes per character is 1.0014609725932648
Bytes per character is 1.0014671435440285

Actually, you have the UTF-8 part wrong: UTF-8 is a variable-length multibyte encoding, so there are valid characters 1-4 bytes in length (in other words, some UTF-8 characters are 8-bit, some are 16-bit, some are 24-bit, and some are 32-bit). Although the 1-byte characters take up 8 bits, there are many more multibyte characters. If you only had 1-byte characters, it would only allow you to have 256 different characters in total (a.k.a. "Extended ASCII"); that may be sufficient for 90% of use in English (my naïve guesstimate), but would bite you in the ass as soon as you even think of anything beyond that subset (see the word naïve - English, yet can't be written just with ASCII).

So, although UTF-16 (which Java uses) looks wasteful, it's actually not. Anyway, unless you're on a very limited embedded system (in which case, what you're doing there with Java?), trying to trim down the strings is pointless microoptimization.

For a slightly longer introduction to character encodings, see e.g. this: http://www.joelonsoftware.com/articles/Unicode.html

When converting byte[] to String using UTF-8, each byte (8bit) becomes a 8 bit character encoded by UTF-8

No. When converting byte[] to String using UTF-8, each UTF-8 sequence of 1-6 bytes is converted into a UTF-16 sequence of 1-2 16-bit characters.

In almost all cases, worldwide, this UTF-16 sequence contains a single character.

In Western Europe and North America, for most text, only 8 bits of this 16-bit character are used. However, if you have a Euro sign, you'll need more than 8 bits.

For more information, see Unicode. Or Joel Spolsky's article.

Java stores all it's "chars" internally as two bytes representations of the value. However, they aren't stored the same as UTF-8. For example, the max value supported is "\uFFFF" (hex FFFF, dec 65536), or 11111111 11111111 binary (two bytes) - but this would be a 3 byte Unicode character on disk.

The only possible wastage is for genuinely 'single' byte characters in memory (most ASCII 'language' characters actually fit in 7bits). When the characters are written to disk, they'll be in the specified encoding anyway (so UTF-8 single byte characters will only occupy one byte).

The only place it makes a difference is in the JVM heap. However, you'd have to have thousands and thousands of 8-bit characters to notice any real difference in Java heap usage - which would be far outweighed by all the extra (hacky) processing you've done.

A million-odd 8-bit characters in RAM is only 'wasting' about 1 MiB anyway...

Redis (the actual server) is dealing only with "binary safe" Strings.

I take this to mean that you can use arbitrary octet sequences for the keys/values. If you can use any C char sequence without thought to character encoding, then the equivalent in Java is the byte type.

Strings in Java are implicitly UTF-16. I mean, you could stick arbitrary numbers in there, but the intent of the class is to represent Unicode character data. Methods that do byte-to-char transformations perform transcoding operations from a known encoding to UTF-16.

If Jedis treats keys/values as UTF-8, then it will not support every value that Redis supports. Not every byte sequence is valid UTF-8, so the encoding can't be used for binary safe strings.

Whether UTF-8 or UTF-16 consumes more memory depends on the data - the euro symbol (€) for example consumes three bytes in UTF-8 and only two in UTF-16.

Just for the record, I wrote my own little implementation of a byte[] <-> String interconverter, that works by casting every 2 bytes in 1 char. It's roughly 30-40% faster and consumes (possibly less than) half the memory of the Java standard way: new String(somebyte) and someString.getBytes().

However, it is incompatible with existing string encoded bytes or byte encoded strings. Furthermore, it is not safe to call the method from different JVMs on shared data.

https://github.com/ib84/castriba

Maybe it is this what you want:

// Store them into the 16 bit datatype.
char c1_8bit = 'a';
char c2_8bit = 'h';
char two_chars = (c1_8bit << 8) + c2_8bit;

// extract them
char c1_8bit = two_chars >> 8;
char c2_8bit = two_chars & 0xFF;

Of course this trick only works with ASCII chars (chars in the range [0-255]). Why? Because you want to store your chars this way:
xxxx xxxx yyyy yyyy with x is char 1 and y is char 2. So this means you have only 8 bit per char. And what is the biggest integer you can make with 8 bit? Answer: 255

255 = 0000 0000 1111 1111 (8 bit). And when you are using a char > 255, then you will have this:
256 = 0000 0001 0000 0000 (more than 8 bit), which doesn't fit in the 8 bit you provide for 1 char.

Plus: Keep in mind Java is a language, developed by clever people. They knew what they where doing. Thrust the Java API