I am now facing a problem of summing up a numeric column with a sliding window defined by values in another column.
(1) My data are tab-delimited, with two numeric columns:
1000 12
2000 10
3000 9
5000 3
9000 5
10000 90
30000 20
31000 32
39000 33
40000 28
(2) I want to sum up 2nd column with a window defined by 1st column, window size is (1st column + 3000). This means I need to add the 3rd column (3rd column = sum(all values of the 2nd columns in rows of 1st column to 1st cloumn+3000)).
It looks like this:
1000 12 12+10+9
2000 10 10+9+3
3000 9 9
5000 3 3
9000 5 5+90
10000 90 90
30000 20 20+32
31000 32 32
39000 33 33
40000 28
(3) I am new for programming. I tried awk, but I am failed.
I do not know how to control the window for the 1st column
awk '(i = 1; i <= NR; i++) { sum[i]+=$2 } END {print $1,$2,sum}' myd开发者_Python百科ata
Could anyone give me any advice/directions on the problem? Thanks in advance.
Best,
I am not very good with awk, but here is something I hacked together in perl, which should also run if you are on a unix system. Assuming you save it as a file called window.pl:
#!/usr/bin/perl -w
use strict;
# Usage: window.pl < [filepath or text stream]
# Example: window.pl < window.txt
my $window = 3000;
my @lines = <STDIN>;
my $i = 0;
my $last_line = $#lines;
# Start reading each line
while ($i<= $last_line)
{
my $current_line = $lines[$i];
my ($col1, $col2) = ( $current_line =~ /(\d+)\s+(\d+)/ );
my $ubound = $col1 + $window;
my @sums = $col2;
my $lookahead = $i + 1;
# Start looking at subsequent lines within the window
while ($lookahead <= $last_line)
{
my $next_line = $lines[$lookahead];
my ($c1, $c2) = ( $next_line =~ /(\d+)\s+(\d+)/ );
if ($c1 <= $ubound)
{
push @sums, $c2;
++$lookahead;
}
else
{
last;
}
}
my $output;
if ( $#sums > 0 )
{
my $sum = join "+", @sums;
$output = "$col1 $sum\n";
}
else
{
$output = "$col1 $col2\n";
}
print $output;
++$i;
}
Output:
1000 12+10+9
2000 10+9+3
3000 9+3
5000 3
9000 5+90
10000 90
30000 20+32
31000 32
39000 33+28
40000 28
This only works if the input file is small enough to read into memory, but maybe this will help you anyways.
Good luck!
Here is a Perl solution:
use warnings;
use strict;
my (%data, @ids);
while (<DATA>) { # read in the data
/^(\d+)\s+(\d+)$/ or die "bad input: $_";
push @ids, $1;
$data{$1} = [$2]
}
for (0 .. $#ids) { # slide window over data
my ($i, $id) = ($_ + 1, $ids[$_]);
push @{$data{$id}}, $data{ $ids[$i++] }[0]
while $i < @ids and $ids[$i] <= $id + 3000;
}
$" = '+'; #"
print "$_: @{$data{$_}}\n" for @ids;
__DATA__
1000 12
2000 10
3000 9
5000 3
9000 5
10000 90
30000 20
31000 32
39000 33
40000 28
Which prints:
1000: 12+10+9 2000: 10+9+3 3000: 9+3 5000: 3 9000: 5+90 10000: 90 30000: 20+32 31000: 32 39000: 33+28 40000: 28
This is not something that really any of the languages are really adept at, and in fact what you're asking is a reasonably challenging programming task, especially for a novice.
Nonetheless, here's an awk script for you:
BEGIN {
window = 3000;
}
function push(line, sum, n) {
n = length(lines);
lines[n] = line;
sums[n] = sum;
}
function pop( n, i) {
n = length(lines);
if (n > 1) {
for(i = 0; i < n - 1; i++) {
lines[i] = lines[i + 1];
sums[i] = sums[i + 1];
}
}
if (n > 0) {
delete lines[n - 1];
delete sums[n - 1];
}
}
{
cur_line = $1;
value = $2;
n = length(lines);
pops = 0;
for (i = 0; i < n; i++) {
if (lines[i] + window < cur_line) {
print "Sum for " lines[i] " = " sums[i];
pops++;
}
}
for (i = 0; i < pops; i++) {
pop();
}
push(cur_line, 0);
n = length(lines);
for (i = 0; i < n; i++) {
sums[i] = sums[i] + value;
}
}
END {
n = length(lines);
for (i = 0; i < n; i++) {
if (lines[i] < cur_line + window) {
print "Sum for " lines[i] " = " sums[i];
}
}
}
And here's a run on your sample data:
Sum for 1000 = 31
Sum for 2000 = 22
Sum for 3000 = 12
Sum for 5000 = 3
Sum for 9000 = 95
Sum for 10000 = 90
Sum for 30000 = 52
Sum for 31000 = 32
Sum for 39000 = 61
Sum for 40000 = 28
Here's a slightly more compact version of a solution:
#!/usr/bin/perl
use strict;
use warnings;
use constant WIN_SIZE => 3000;
my @pending;
while (<>) {
my ($pos, $val) = split;
# Store line info, sum, and when to stop summing
push @pending, { pos => $pos,
val => $val,
limit => $pos + WIN_SIZE,
sum => 0 };
show($_) for grep { $_->{limit} < $pos } @pending; # Show items beyond window
@pending = grep { $_->{limit} >= $pos } @pending; # Keep items still in window
$_->{sum} += $val for @pending; # And continue their sums
}
# and don't forget those items left within the window when the data ran out
show($_) for @pending;
sub show {
my $pending = shift;
print join("\t", $pending->{pos}, $pending->{val}, $pending->{sum}), "\n";
}
Just drop it in a script and give you datafile on the same line, e.g.:
$ perl script.pl mydata
1000 12 31
2000 10 22
3000 9 12
5000 3 3
9000 5 95
10000 90 90
30000 20 52
31000 32 32
39000 33 61
40000 28 28
精彩评论