I have a big xml file and parsing it consumes a lot of memory.
since I believe most of it is due to a lot of user name in the file. I changed the length of each user name from ~28 Bytes to 10 bytes. and run again. but it still takes almost the same amount of memory. the xml file is so far parsed with SAX and during handling, the result is stored in a hash structure, like this:$this->{'date'}->{'school 1'}->{$class}->{$student}...
why the memory is still so much after I reduce the length of student name? is it possible w开发者_如何学编程hen the data is stored in hash memory. there are a lot of overhead no matter how lone the length of string is?
Perl hashes use a technique known as bucket-chaining. All keys that have the same hash (see the macro PERL_HASH_INTERNAL
in hv.h
) go in the same “bucket,” a linear list.
According to the perldata documentation
If you evaluate a hash in scalar context, it returns false if the hash is empty. If there are any key/value pairs, it returns true; more precisely, the value returned is a string consisting of the number of used buckets and the number of allocated buckets, separated by a slash. This is pretty much useful only to find out whether Perl's internal hashing algorithm is performing poorly on your data set. For example, you stick 10,000 things in a hash, but evaluating
%HASH
in scalar context reveals"1/16"
, which means only one out of sixteen buckets has been touched, and presumably contains all 10,000 of your items. This isn't supposed to happen. If a tied hash is evaluated in scalar context, a fatal error will result, since this bucket usage information is currently not available for tied hashes.
To see whether your dataset has a pathological distribution, you could inspect the various levels in scalar context, e.g.,
print scalar(%$this), "\n",
scalar(%{ $this->{date} }), "\n",
scalar(%{ $this->{date}{"school 1"} }), "\n",
...
For a somewhat dated overview, see How Hashes Really Work at perl.com.
The modest reduction in the lengths of students' names, keys that are four levels down, won't make a significant difference. In general, the perl implementation has a strong bias toward throwing memory at problems. It ain't your father's FORTRAN.
Yes - there is a LOT of overhead. If possible, don't store the data as a full tree, especially since you're using a SAX parser which frees you from the necessities of doing so imposed by a DOM one.
If you MUST store the entire tree, one possible workaround is storing arrays of arrays - e.g. you store all student names in an array (with, say "mary123456" being stored in $students[11]
, and then store a hash value that would have been ...->{"mary123456"}
as ->[11]
instead.
It WILL increase processing time due to extra layers of indirection, but might decrease due to less memory usage and thus less swapping/thrashing.
Another option is using hashes tied to files, though it would be REALLY slow due to disk IO bottleneck, of course.
It may be useful to use the Devel::Size module that can report back how big various data structures are:
use Devel::Size qw(total_size);
print "Total Size is: ".total_size($hashref)."\n";
精彩评论