I'm currently running a perl program where I have to take a 1 million line text file, break it down into chunks (anywhere between 50 and 50,000 lines per chunk), and run some calculations and such on them. Right now, I load all of the data into array1. I take array2 and use it to pull just the chunks of data I need. I then do what I need to perform on array 2, and then go back and grab the next set.
example data
A, blah1, blah2
A, blah6, blah7
A, blah4, blah5
B, blah2, blah2
So I would grab the first three into array 2, sort them, then move on to the next s开发者_StackOverflowet. My program works pretty well and efficiently to begin with, but it experiences a severe slowdown later on.
50K takes 50 seconds, 100k takes 184 seconds, 150k takes 360 seconds, 200k takes 581 seconds, and it only gets exponentially worse as the program continues (4500 seconds at line 500k)
No, I cannot use a database for this project, any suggestions?
my @Rows1=<FILE>;
my $temp = @Rows1;
for($k = 0; $k < $temp; $k++)
{
my @temp2array = ();
my $temp2count = 0;
my $thisrow = $Rows1[$k];
my @thisarray = split(',', $thisrow);
my $currcode = $thisarray[0];
my $flag123 = 0;
$temp2array[$temp2count] = $thisrow;
$temp2count++;
while ($flag123 == 0)
{
$nextrow = $tuRows1[$k + 1];
@nextarray = split(',', $nextrow);
if ($currcode eq $nextarray[0])
{
$temp2array[$temp2count] = $nextrow;
$k++;
$temp2count++;
}
else
{
$flag123 = 1;
}
}
}
I have edited my code to more resemble the answer below, and I've got these times:
50k = 42, 100k = 133, 150k = 280, 200k = 467, 250k = 699, 300k = 978, 350k = 1313
Its not exactly keeping linear, and by this trend, this prog will still take 14000+ seconds. I'll investigate other parts of the code
Loading an entire large file into memory will slow you down as your OS will need to start swapping pages of virtual memory. In such cases, it is best to deal with only the section of the file that you need.
In your case, you seem to be processing lines that have the same value in the first field together, so you could do something like:
my @lines = ();
my $current_key = '';
while (<FILE>) {
my ($key) = split /,/; # get first column
if ($key ne $current_key) {
# new key. Process all the lines from the previous key.
if (@lines > 0) {
process(@lines);
}
@lines = ();
$current_key = $key;
}
push @lines, $_
}
# don't forget the lines from the last key
if (@lines > 0) {
process(@lines);
}
This way, you are only storing in memory enough lines to make up one group.
(I am assuming that the input data is sorted or organized by the key. If that's not the case, you could make multiple passes through the file: a first pass to see what keys you will need to process, and subsequent passes to collect the lines associated with each key.)
Does just running the code you show have the slowdown? If not, the problem is in the code that actually processes each @temp2array chunk, perhaps some variable(s) still having data left over from previous chunks.
精彩评论