I'm creating a hash table in Perl, of an unknown size.
The hash table maps a string to a reference to an array.
The main loop of my application adds 5-10 elements to the hash table in each iteration. As the hash table fills up, things start to slow down drastically. From observation, when there are ~50k keys in the hash table, adding keys slows by a magnitude of 20x.
I postulate that the hash table has become full, and collisions are occurring. I would like to 'reserve' the size of the hash table, but I'm unsure how.
The hash in question is hNgramsToWord.
For each word, the 1-len-grams of that word are added as keys, with a reference to an array of words which contain that ngram.
For example:
AddToNgramHash("Hello");
[h, e, l, l, o, he, el, ll, lo, hel, llo, hell, ello, hello ] are all added as keys, mapping to "hello"
sub AddToNgramHash($) {
my $word = shift;
my @aNgrams = MakeNgrams($word);
foreach my $ngram (@aNgrams) {
my @aWords;
if(defined($hNgramsToWord{$ngram})) {
@aWords = @{$hNgramsToWord{$ngram}};
}
push (@aWords, $word);
$hNgramsToWord{$ngram} = \@aWords;
}
return scalar keys %hNgramsToWord;
}
sub MakeNgrams($) {
my $word = shift;
my $len = length($word);
my @aNgrams;
for(1..$len) {
m开发者_Go百科y $ngs = $_;
for(0..$len-$ngs) {
my $ngram = substr($word, $_, $ngs);
push (@aNgrams, $ngram);
}
}
return @aNgrams;
}
You can set the number of buckets for a hash like so:
keys(%hash) = 128;
The number will be rounded up to a power of two.
That said, it is very unlikely that the slowdown you see is due to excess hash collisions, since Perl will dynamically expand the number of buckets as needed. And since 5.8.2, it will even detect pathological data that results in a given bucket being overused and reconfigure the hashing function for that hash.
Show your code, and we will likely be able to help find the real problem.
A demonstration of a large number of hash keys (don't let it continue till you are out of memory...):
use strict;
use warnings;
my $start = time();
my %hash;
$SIG{ALRM} = sub {
alarm 1;
printf(
"%.0f keys/s; %d keys, %s buckets used\n",
keys(%hash) / (time() - $start),
scalar(keys(%hash)),
scalar(%hash)
);
};
alarm 1;
$hash{rand()}++ while 1;
Once there are a LOT of keys, you will notice a perceptible slowdown when it needs to expand the number of buckets, but it still maintains a pretty even pace.
Looking at your code, the more words are loaded, the more work it has to do for each word.
You can fix it by changing this:
my @aWords;
if(defined($hNgramsToWord{$ngram})) {
@aWords = @{$hNgramsToWord{$ngram}};
}
push (@aWords, $word);
$hNgramsToWord{$ngram} = \@aWords;
to this:
push @{ $hNgramsToWord{$ngram} }, $word;
No need to copy the array twice. No need to check if the ngram already has an entry - it will autovivify an array reference for you.
精彩评论