开发者

How can I improve Perl's performance with massive data?

开发者 https://www.devze.com 2022-12-15 13:45 出处:网络
I\'m searching for improving my algorithm for an allocate massive data riddle. If anyone can think of some good idea that makes it run faster, I\'ll appreciate it.

I'm searching for improving my algorithm for an allocate massive data riddle. If anyone can think of some good idea that makes it run faster, I'll appreciate it.

I have 8 files contains 2 parameters:

1) start point

2) end point

These type of data repeat itself with different numbers along the file. means I could have a file with 200 start points + 200 end points, or a file with 50000 start points + 50000 end points.

OK, these data eventuality reflect the X axis data points. each point start/end will be sorted into X_sort_array.

The problem starts with the Y axis data points.

In order to create the Y axis data points, I made up with some algorithm that run all over the X_sort_array points and check for each point if it is between all of start-end points individuality . if it is add one for Y_array[x_point] and so on...

I'll paste the massive part of my code here, if there is a syntax/way to perform it more efficiency, I will glad to know.

when sort_all_legth bigger then 50000 - the slow motion began:

    my $sort_all_length = @{$values{"$tasks_name[$i]\_sort\_allvals"}};
    my $core_count = 0;
    my $m=0;

    for my $k(0 .. $sort_all_length-1){



    if (($values{"$tasks_name[$i]\_sort\_allvals"}->[$k] eq $values{"$tasks_name[$i]\_sort\_allvals"}->[$k-1])
& ($k != 0) ) {
}

    else {
    my $pre_point_flag;
    # CHEACK PREVIUES POINT:
    for my $inst_num(0 .. $sort_all_length/2-1){
     $pre_point_flag=1;
     if ( ($values{"TID$i\_$tasks_name[$i]\_SHL"}->[$inst_num] <=    $values{"$tasks_name[$i]\_sort\_allvals"}->[$k]-1) 
   & ($values{"$tasks_name[$i]\_sort\_allvals"}->[$k]-1 <= $values{"TID$i\_$tasks_name[$i]\_EHL"}->[$inst_num]) )  {
    $pre_point_flag=0;
    last;
 }
}
if ($pre_point_flag eq 1){
 push(@{$values{"$tasks_name[$i]\_Y"}},0);
 push(@{$values{"$tasks_name[$i]\_X"}},$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]-1);
}

# CHEACK DEFINE POINT:
for my $inst_num(0 .. $sort_all_length/2-1){
 if ( ($values{"TID$i\_$tasks_name[$i]\_SHL"}->[$inst_num] <= $values{"$tasks_name[$i]\_sort\_allvals"}->[$k]) 
   & ($values{"$tasks_name[$i]\_sort\_allvals"}->[$k] <= $values{"TID$i\_$tasks_name[$i]\_EHL"}->[$inst_num]) )  {
    $core_count++;
 }

}
push(@{$values{"$tasks_name[$i]\_Y"}},$core_count);
push(@{$values{"$tasks_name[$i]\_X"}},$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]);

# CHEACK LATER POINT:
for my $inst_num(0 .. $sort_all_length/2-1){
 $pre_point_flag=1;
 if ( ($values{"TID$i\_$tasks_name[$i]\_SHL"}->[$inst_num] <= $values{"$tasks_name[$i]\_sort\_allvals"}->[$k]+1) 
   & ($values{"$tasks_name[$i]\_sort\_allvals"}->[$k]+1 <= $values{"TID$i\_$tasks_name[$i]\_EHL"}->[$inst_num]) )  {
    $pre_point_flag=0;
    last;
 }
}
if ($pre_point_flag eq 1){
 push(@{$values{"$tasks_name[$i]\_Y"}},0);
 push(@{$values{"$tasks_name[$i]\_X"}},$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]+1);
}

if ($y_max_val < $core_count){
 $y_max_val = $core_count;
}

$core_count = 0;
$m++;
}



}

small data example that has good performance :

(bad performance comes up with I.P size of 50000 !)

database: task: byte_position

database: binary file size: 3216

database: numbers of start/end values = 804

database: counting active cores in sorted I.P array (size: 1608)...

database: I.P array value:

375127997,375135023,375177121,375177978,375225484,375226331, 375273745,375274563,375320063,375325255,375372479,375377085, 375422115,375422896,375473198,375474058,375517412,375518169, 375561967,375562760,375606301,375607092...

TEST example input/output:

inputs are start and end points on X axis:

16,255

16,255

16,255

100,355

200,455

database: task: TEST

database: binary file size: 20

database: numbers of start/end values = 5

database: sorted I.P array (X axis):

16,16,16,100,200,255,255,255,355,455

each of these points called interesting point (on X axis).

in order to calculate Y values, I draw a vertical line with each I.P.

then I count how many input lines (start/end point) pass though the first vertical line.

the result is the first Y point for the first X I.P

database: counting active cores in sorted I.P array (size: 10)...

output vector:

database: sorted cores array (Y axis):

0,3,4,5,5,2,1,0

In my algorithm I added another point before and after each I.P that start/end a blo开发者_开发百科ck.

that's way you can see zero at the beginning/end !

3 - means there are 3 slicing lines at X point 16

4 - means there are 4 slicing lines at X point 100

5 - (1)means there are 5 slicing lines at X point 200

5 - (2)means there are 5 slicing lines at X point 255

2 - means there are 2 slicing lines at X point 355

1 - means there are 1 slicing lines at X point 455

Hope these extended example makes you more understand the algorithm. :)

I will rebuild the code for readability.

Thanks, Yodar.


A very simple way to improve your performance there entails simply learning better coding practices.

One tenet you are ignoring is this: Do not repeat yourself.

You repeat certain bits of code in there a lot, forcing Perl to evaluate the same expression again and again. An example for this is this string:

"$tasks_name[$i]\_sort\_allvals"

It's used about 10 times in there. That means every single time you use it perl refers to the array, in case it changed, and puts together that string. It may not seem much, but eventually it adds up.

Another example is this:

$values{"$tasks_name[$i]\_sort\_allvals"}->[$k]

It's used 10 times as well and while $k actually changes each loop, for each run through the loop the value of that whole expression is the same. It would be faster to store it in a single scalar at the beginning of the loop and then use that scalar throughout the rest of it, as you avoid forcing perl to resolve a reference 10 times per loop.


I'm having a little trouble following it, but it sounds like you're searching a (conveniently presorted) array of datafile-X-points from the start for my $var (0..whatever) { ... last if $done; } repeatedly for each axis-X-point.

Total Shlemiel the Painter algorithm. This is probably the source of the performance problem. You should try avoiding parts of the search you don't need to do again.


I tried to understand your program but my brain choked a bit. Can you provides us an example input, some explanation of what you want to do, and good example output? From your example I can't make head or tails.

You say this is some valid input:

database: task: TEST

database: binary file size: 20

database: numbers of start/end values = 5

database: sorted I.P array (X axis):

16,16,16,100,200,255,255,255,355,455

database: counting active cores in sorted I.P array (size: 10)...

and this is valid output:

database: sorted cores array (Y axis):

0,3,4,5,5,2,1,0

but I don't understand it. Care to explain what you want to achieve?


You don't need intervals for this task at all. You can just sort starts and ends of intervals and then merge this sorted arrays and count number of opened intervals. It is definitely O(NlogN) algorithm and process 50k intervals in fraction of second.

#!/usr/bin/env perl

use strict;
use warnings;

my (@starts, @ends);

while(<>) {
    chomp;
    my($start, $end) = split',';
    push @starts, $start;
    push @ends, $end;
}

@starts = sort { $a <=> $b } @starts;
@ends = sort { $a <=> $b } @ends;

my $y = 0;
my ($x, @y, @x);
while (@starts) {
    my $x = $starts[0] <= $ends[0] ? $starts[0] : $ends[0];
    while ($starts[0] == $x) {
        $y++;
        shift @starts;
        last unless @starts;
    }
    push @x, $x;
    push @y, $y;
    while ($ends[0] == $x) {
        $y--;
        shift @ends;
        last unless @ends;
    }
}

while (@ends) {
    my $x = $ends[0];
    push @x, $x;
    push @y, $y;
    while ($ends[0] == $x) {
        $y--;
        shift @ends;
        last unless @ends;
    }
}

die "Unbalanced!" if $y;

$" = ',';
print "0,@y,0\n";


Based on what you say here:

In order to create the Y axis data points, I made up with some algorithm that run all over the X_sort_array points and check for each point if it is between all of start-end points individuality . if it is add one for Y_array[x_point] and so on...

and assuming this is your problem, then a modified binary search algorithm will work in O(log n), or approximately 6 steps for n = 1_000_000:

sub binary_range_search {
    my ( $range, $ranges ) = @_;
    my ( $low, $high ) = ( 0, @{$ranges} - 1 );
    while ( $low <= $high ) {

        my $try = int( ( $low + $high ) / 2 );

        $low  = $try + 1, next if $ranges->[$try][1] < $range->[0];
        $high = $try - 1, next if $ranges->[$try][0] > $range->[1];

        return $ranges->[$try];
    }
    return;
}

In this version, you're looking to find whether the range @$range overlaps any of the ranges in @$ranges. You can modify it to look for overlaps of a single point over all ranges. Is that what you're looking for?


May be the better way is to load your data to BD , create indexes on their fields and run SQL like SELECT start, stop FROM X WHERE Y between start, stop;

BD has a good algorithms of binary search. Probably better then yours.

Another approach is convert your data from string to binary and may be use C static array to manipulate it. like

struct point {
   int start, int stop      
}  

point array[8000];
read array form file;
for (int i=0; i<8000; i++)
{
}

You will wonder how speedy it is

0

精彩评论

暂无评论...
验证码 换一张
取 消