开发者

Java : linear algorithm but non-linear performance drop, where does it come from?

开发者 https://www.devze.com 2022-12-20 03:40 出处:网络
I am currently having heavy performance issues with an application I\'m developping in natural language processing. Basically, given texts, it gathers various data and does a bit of number crunching.

I am currently having heavy performance issues with an application I'm developping in natural language processing. Basically, given texts, it gathers various data and does a bit of number crunching.

And for every sentence, it does EXACTLY the same. The algorithms applied to gather the statistics do not evolve with previously read data and therefore stay the same.

The issue is that the processing time does not evolve linearly at all: 1 min for 10k sentences, 1 hour for 100k and days for 1M...

I tried everything I could, from re-implementing basic data structures to object pooling to recycles instances. The behavior doesn't change. I get non-linear increase in time that seem impossible to justify by a little more hashmap collisions, nor by IO waiting, nor by anything! Java starts to be sluggish when data increases and I feel totally helpless.

If you want an example, just try the following: count the number of occurences of each word in a big file. Some code is shown below. By doing this, it takes me 3 seconds over 100k sentences and 326 seconds over 1.6M ...so a multiplicator of 110 times instead of 16 times. As data grows more, it just get worse...

Here is a code sample: Note that I compare strings by reference (for efficiency reasons), this can be done thanks to the 'String.intern()' method which returns a unique reference per string. And the map is never re-hashed during the whole process for the numbers given above.

public class DataGathering
{
 SimpleRefCounter<String> counts = new SimpleRefCounter<String>(1000000);

 private void makeCounts(String path) throws IOException
 {

  BufferedReader file_src = new BufferedReader(new FileReader(path));

  String line_src;

  int n = 0;
  while (file_src.ready())
  {
   n++;

   if (n % 10000 == 0)
    System.out.print(".");

   if (n % 100000 == 0)
    System.out.println("");

   line_src = file_src.readLine();

   String[] src_tokens = line_src.split("[ ,.;:?!'\"]");

   for (int i = 0; i < src_tokens.length; i++)
   {
    String src = src_tokens[i].intern();
    counts.bump(src);
   }

  }
  file_src.close();
 }

 public static void main(String[] args) throws IOException
 {
  String path = "some_big_file.txt";

  long timestamp = System.currentTimeMillis();

  DataGathering dg = new DataGathering();
  dg.makeCounts(path);


  long time = (System.currentTimeMillis() - timestamp) / 1000;
  System.out.println("\nElapsed time: " + time + "s.");
 }
}

public class SimpleRefCounter<K>
{
 static final double GROW_FACTOR = 2;
 static final double LOAD_FACTOR = 0.5;

 private int capacity;

 private Object[] keys;
 private int[] counts;


 public SimpleRefCounter()
 {
  this(1000);
 }

 public SimpleRefCounter(int capacity)
 { 
  this.capacity = capacity;
  keys = new Object[capacity];
  counts = new int[capacity];
 }



 public synchronized int increase(K key, int n)
 {
  int id = System.identityHashCode(key) % capacity;

  while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
   id = (id + 1) % capacity;


  if (keys[id] == null)
  {
   key_count++;
   keys[id] = key;

   if (key_count > LOAD_FACTOR * capacity)
   {
    resize((int) (GROW_FACTOR * capacity));
   }
  }


  counts[id] += n;

  total += n;

  return counts[id];
 }



 public synchronized void resize(int capacity)
 {
  System.out.println("Resizing counters: " + this);

  this.capacity = capacity;

  Object[] new_keys = new Object[capacity];
  int[] new_counts = new int[capacity];

  for (int i = 0; i < keys.length; i++)
  {
   Object key = keys[i];
   int count = counts[i];

   int id = System.identityHashCode(key) % capacity;

   while (new_keys[id] != null && new_keys[id] != key) // if it's occupied, let's move to the next one!
    id = (id + 1) % capacity;

   new_keys[id] = key;
   new_counts[id] = count;
  }

  this.keys = new_keys;
  this.counts = new_counts;
 }


 public int bump(K key)
 {
  return increase(key, 1);
 }

 public int get(K key)
 {
  int id = System.identityH开发者_如何学运维ashCode(key) % capacity;

  while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
   id = (id + 1) % capacity;


  if (keys[id] == null)
   return 0;
  else
   return counts[id];
 }
    }

Any explanations? Ideas? Suggestions?

...and, as said in the beginning, it is not for this toy example in particular but for the more general case. This same exploding behavior occurs for no reason in the more complex and larger program.


Rather than feeling helpless use a profiler! That would tell you where exactly in your code all this time is spent.


Bursting the processor cache and thrashing the Translation Lookaside Buffer (TLB) may be the problem.

For String.intern you might want to do your own single-threaded implementation.

However, I'm placing my bets on the relatively bad hash values from System.identityHashCode. It clearly isn't using the top bit, as you don't appear to get ArrayIndexOutOfBoundsExceptions. I suggest replacing that with String.hashCode.


String[] src_tokens = line_src.split("[ ,.;:?!'\"]");

Just an idea -- you are creating a new Pattern object for every line here (look at the String.split() implementation). I wonder if this is also contributing to a ton of objects that need to be garbage collected?

I would create the Pattern once, probably as a static field:

final private static Pattern TOKEN_PATTERN = Pattern.compile("[ ,.;:?!'\"]");

And then change the split line do this:

String[] src_tokens = TOKEN_PATTERN.split(line_src);

Or if you don't want to create it as a static field, as least only create it once as a local variable at the beginning of the method, before the while.


In get, when you search for a nonexistent key, search time is proportional to the size of the set of keys.

My advice: if you want a HashMap, just use a HashMap. They got it right for you.


You are filling up the Perm Gen with the string intern. Have you tried viewing the -Xloggc output?


I would guess it's just memory filling up, growing outside the processor cache, memory fragmentation and the garbage collection pauses kicking in. Have you checked memory use at all? Tried to change the heap size the JVM uses?


  1. Try to do it in python, and run the python module from Java.
  2. Enter all the keys in the database, and then execute the following query:

    select key, count(*) from keys group by key

  3. Have you tried to only iterate through the keys without doing any calculations? is it faster? if yes then go with option (2).


Can't you do this? You can get your answer in no time.


It's me, the original poster, something went wrong during registration, so I post separately. I'll try the various suggestions given. PS for Tom Hawtin: thanks for the hints, perhaps the 'String.intern()' takes more and more time as vocabulary grows, i'll check that tomorrow, as everything else.

0

精彩评论

暂无评论...
验证码 换一张
取 消