开发者

How do I merge specific columns from files in array or hash of multiple file handles, one line at a time?

开发者 https://www.devze.com 2023-02-15 19:44 出处:网络
I\'ll start by describing the files I am working with: ./groupA ./groupA/fileA.txt ./groupA/fileB.txt ./groupA/fileC.txt

I'll start by describing the files I am working with:

./groupA
    ./groupA/fileA.txt
    ./groupA/fileB.txt
    ./groupA/fileC.txt
    ./groupA/fileD.txt

./groupB
    ./groupB/fileA.txt
    ./groupB/fileB.txt
    ./groupB/fileC.txt

etc.

Here is what I would like to do:

  1. I have a hash or array of file handles for each groupI, pointing to very large tab-delimited text files fileJ, each several hundreds of MB in size.

  2. I would like to loop through the file handles, reading in one tab-delimited line at a time. I cannot read all the files' lines into memory.

  3. Once I finish looping through the file handles, I then would like to split each line, grab a specific column of data from each split-array (fifth field, for example), and merge the data into a line of output.

  4. Repeat step 2 — grabbing one line from each file handle — until EOF.

I will then end up with groupA/mergedOutput.mtx, groupB/mergedOutput.mtx, etc.

The problem is that I don't know how to do steps 2 and 3 correctly.

Here is the code I have so far:

#!/usr/bin/perl

use strict;
use warnings;
use File::Glob qw(glob);

my @groups = qw(groupA groupB groupC);
my ($mergedOutputFn, %fileHandles);

foreach my $group (@groups) {
    $mergedOutputFn = "$group/mergedOutput.mtx";

    # Step 1:
    # Make hash table of file handles

    foreach my $inputFn (<"$group/*.txt">) {
        open my $handle, '< $inputFn' or die "could not open $inputFn\n";
        $fileHandles{$inputFn} = $handle;
    }

    # Steps 2 and 3:
    # Grab a line from each file handle
    # Repeat until EOF

    while(1) {
        my @mergedOutputLineElements = ();
        foreach (sort keys %handles) {
            my $handle = $handles{$_};
            my $line = <$handle>;
            chomp($line);
            my @lineElements = split("\t", $line);
            push (@mergedOutputLineElements, $lineElements[4]);
            last if (! defined $line); # jump out of while loop
        }
        print Dumper join("\t", @mergedOutputLineElements);
    }

    # Step 4:
    # Close handles

    foreach (sort keys %handles) {
        close $handles{$_};
    } 
}

One issue seems to be that the following code doesn't work:

foreach (sort keys %handles) {
    my $handle = $handles{$_};
    my $line = <$handle>;
    ...
}

If I try to print out the value of $line, then I get a GLOB value:

print Dumper $line;
...
GLOB(0x1d769f80)

How am I mishandling $line, or is there an easier way to do this within Perl?

Thanks for your advice.

EDIT

Here is the fixed code:

#!/usr/bin/perl

use strict;
use warnings;
use File::Glob qw(glob);

my @groups = qw(groupA groupB groupC);
my ($mergedOutputFn, %fileHandles);

foreach my $group (@groups) {
    $mergedOutputFn = "$group/mergedOutput.mtx";
    open MERGE, "> $mergedOutputFn" or die "could not open handle to $mergedOutputFn\n";

    # Step 1:
    # Make hash table of file handles

    foreach my $inputFn (<"$group/*.txt">) {
        open my $handle, '< $inputFn' or die "could not open $inputFn\n";
        $fileHandles{$inputFn} = $handle;
    }

    # Steps 2 and 3:
    # Grab a line from each file handle
    # Repeat until EOF

    LINE: while(1) {
        my @mergedOutputLineElements = ();
        foreach (sort keys %handles) {
            my $handle = $handles{$_};
            my $line = readline $handle;
            last LINE if (! defined $line); # jump out of while loop
            chomp($line);
            my @lineElements = split("\t", $line);
            push (@mergedOutputLineElements, $lineElements[4]);
        }
        print MERGE join("\t", @mergedOutputLineElements);
    }

    # Step 4:
    # Close handles

    foreach (sort keys %handles) {
        close $handles{$_};
    } 

    close MERGE;
}

Thanks for the tip开发者_Python百科s!


You can read from filehandles like this:

foreach (sort keys %handles) {
    my $line = readline $handles{$_};
    ...
}
0

精彩评论

暂无评论...
验证码 换一张
取 消