Why is string.intern() so slow?_问答_开发者_运维开发者技术经验分享

Before anyone questions the fact of using string.intern() at all, let me say that I need it in my particular application for memory and performance reasons. ^[1]

So, until now I used String.intern() and assumed it was the most efficient way to do it. However, I noticed since ages it is a bottleneck in the software. ^[2]

Then, just recently, I tried to replace the String.intern() by a huge map where I put/get the strings in order to obtain each time a unique instance. I expected this would be slower... but it was exactly the opposite! It was tremendously faster! Replacing the intern() by pushing/polling a map (which achieves exactly the same) resulted in more than one order of magnitude faster.

The question is: why is intern() so slow?!? Why isn't it then simply backed up by a map (or actually, just a customized set) and would be tremendously faster? I'm puzzled.

[1]: For the unconvinced ones: It is in natural language processing and has to process gigabytes of text, therefore needs to avoid many instances of a same string to avoid blowing up the memory and referential string comparison to be fast enough.

[2]: without it (normal strings) it is impossible, with it, this particular step remains the most computation intensive one

EDIT:

Due to the surprising interest in this post, here is some code to test it out:

http://pastebin.com/4CD8ac69

And the results of interning a bit more than 1 million strings:

HashMap: 4 seconds
String.intern(): 54 seconds

Due to avoid some warm-up / OS IO caching an开发者_JAVA百科d stuff like this, the experiment was repeated by inverting the order of both benchmarks:

String.intern(): 69 seconds
HashMap: 3 seconds

As you see, the difference is very noticeable, more than tenfolds. (Using OpenJDK 1.6.0_22 64bits ...but using the sun one resulted in similar results I think)

This article discusses the implementation of String.intern(). In Java 6 and 7, the implementation used a fixed size (1009) hashtable so as the number entries grew, the performance became O(n). The fixed size can be changed using -XX:StringTableSize=N. Apparently, in Java8 the default size is larger but issue remains.

Most likely reason for the performance difference: String.intern() is a native method, and calling a native method incurs massive overhead.

So why is it a native method? Probably because it uses the constant pool, which is a low-level VM construct.

@Michael Borgwardt said this in a comment:

intern() is not synchronized, at least at the Java language level.

I think that you mean that the String.intern() method is not declared as synchronized in the sourcecode of the String class. And indeed, that is a true statement.

However:

Declaring intern() as synchronized would only lock the current String instance, because it is an instance method, not a static method. So they couldn't implement string pool synchronization that way.
If you step back and think about it, the string pool has to perform some kind of internal synchronization. If it didn't it would be unusable in a multi-threaded application, because there is simply no practical way for all code that uses the intern() method to do external synchronization.

So, the internal synchronization that the string pool performs could be a bottleneck in multi-threaded application that uses intern() heavily.

I can't speak from any great experience with it, but from the String docs:

"When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the {@link #equals(Object)} method, then the string from the pool returned. Otherwise, this String object is added to the pool and a reference to this String object is returned."

When dealing with large numbers of objects, any solution involving hashing will outperform one that doesn't. I think you're just seeing the result of misusing a Java language feature. Interning isn't there to act as a Map of strings for your use. You should use a Map for that (or Set, as appropriate). The String table is for optimization at the language level, not the app level.