开发者

The more I use a Java HashMap, the more the performance drops - even with stable size

开发者 https://www.devze.com 2023-04-06 03:15 出处:网络
I want to scan through a huge corpus of text and count word frequencies (n-gram frequencies actually for those who are familiar with NLP/IR). I use a Java HashMapfor this. So what happens is I process

I want to scan through a huge corpus of text and count word frequencies (n-gram frequencies actually for those who are familiar with NLP/IR). I use a Java HashMap for this. So what happens is I process the text line by line. For each line, I extract the words, and for each word, I update the corresponding frequency in the hashmap.

The problem is that this process gets slower and slower. For example, it starts by processing around 100k lines / second - and the performance starts falling right away. After about 28 million lines , the performance has fallen to 16k lines / second - and of course keeps falling.

First thing that came to mind was that it was caused of too many entries in the hashmap, which caused every put and every get to be slower every time. So what I tried was to only keep the most (say 100k) frequent entries in the hashmap at anytime. This was done by using a second map that mapped frequencies to words (as in here: Automatically sorted by values map in Java )

This performed a lot faster in general. (although it started at 56 k lines / sec, by the time it reached 28 mil lines, the performance had only dropped to 36.5k lines / sec). However, this also kept falling, at a much slower rate - but the fact remains, that it kept falling.

Have you got any possible explanation of why does this happen when the hashmap's size remains the same? Do you think this has anything to do with the garbage collector? Mean开发者_运维技巧ing, that the fact that I keep putting, and removing object to/from hashmaps fragments up the memory or something? Or could it be a hashing function problem? Since I'm using strings, the hashing function is Java's default hashing function for strings.

Here is the part of my code that performs the aforementioned task:

http://pastebin.com/P8S6Sj86

NOTE: I am a Java newbie so any elaboration in your answers is more than welcome


I recommend using Java VisualVM to do some profiling. This comes with Java - go to the command line and type jvisualvm to run it. This makes it easy to see if memory churn is your problem, or if particular types of objects are being created hundreds of thousands of times.

If you break up your code into several methods, you'll also be able to tell which methods take too long to run.

I did notice that you are unnecessarily creating lots of objects in inner loops. This certainly won't help performance, although it may not be the main culprit.

For example:

float avg = new Float(sumItems) / new Float (freqMap.size());

should just be

float avg = (float)sumItems / freqMap.size();

Another piece of your code which also could be troublesome is:

System.out.println(numItems + " items counted");

Depending on your operating system or IDE, writing 100,000s of lines to the console requires significant time. Instead, just write an update of progress for each 1000 items.


Suggestion:

Try implementing a custom hashCode method for the object you're storing in your hashmap. Here are some links:

Java HashMap performance optimization / alternative

http://www.ibm.com/developerworks/java/library/j-jtp05273/index.html

http://www.javamex.com/tutorials/collections/hash_function_guidelines.shtml

Bad idea to use String key in HashMap?

0

精彩评论

暂无评论...
验证码 换一张
取 消