Say, I have a file that has the following lines with a "TIMESTAMP" "NAME":
10:00:00 Bob
11:00:00 Tom 11:00:20 Fred 11:00:40 George 12:00:00 BillI want to read this file, group the names that occur in each hour on a single line, then write the revised lines to a file, for example.
10:00:00 Bob
11:00:00 Tom, Fred, 开发者_高级运维George 12:00:00 BillGiven that, per comments on the original question, all entries for the same hour are contiguous and the file is too large to fit into memory, I would dispense with the hash entirely - if the raw file is too big to fit in memory, then a hash containing all of its data will likely also be too large. (Yes, it's compressing the data a bit, but the hash itself adds substantial overhead.)
My solution, then:
#!/usr/bin/env perl
use strict;
use warnings;
my $current_hour = -1;
my @names;
while (my $line = <DATA>) {
my ($hour, $name) = $line =~ /(\d{2}):\d{2}:\d{2} (.*)/;
next unless $hour;
if ($hour != $current_hour) {
print_hour($current_hour, @names);
@names = ();
$current_hour = $hour;
}
push @names, $name;
}
print_hour($current_hour, @names);
exit;
sub print_hour {
my ($hour, @names) = @_;
return unless @names;
print $hour, ':00:00 ', (join ', ', @names), "\n";
}
__DATA__
10:00:00 Bob
11:00:00 Tom
11:00:20 Fred
11:00:40 George
12:00:00 Bill
In grouped_by_hour
below, for each line from the filehandle, if it has a timestamp and a name, we push
that name onto an array associated with the timestamp's hour, using sprintf
to normalize the hour in case one timestamp is 03:04:05
and another is 3:9:18
.
sub grouped_by_hour {
my($fh) = @_;
local $_;
my %hour_names;
while (<$fh>) {
push @{ $hour_names{sprintf "%02d", $1} } => $2
if /^(\d+):\d+:\d+\s+(.+?)\s*$/;
}
wantarray ? %hour_names : \%hour_names;
}
The normalized hours also allow us to sort with the default comparison. The code below places the input in the special DATA
filehandle by having it after the __DATA__
token, but in real code, you might call grouped_by_hour $fh
.
my %hour_names = grouped_by_hour \*DATA;
foreach my $hour (sort keys %hour_names) {
print "$hour:00:00 ", join(", " => @{ $hour_names{$hour} }), "\n";
}
__DATA__
10:00:00 Bob
11:00:00 Tom
11:00:20 Fred
11:00:40 George
12:00:00 Bill
Output:
10:00:00 Bob 11:00:00 Tom, Fred, George 12:00:00 Bill
Read the file line by line in a block like this:
while(<>) {
# ... do something with the line in $_
# specifically, collect the hour and name
# ignoring malformed lines
if (/(\d\d):\d\d:\d\d\s+(\w+)/) {
my $hour = $1;
my $name = $2;
}
}
and build a hash with the first bit by inserting the following in the inner if
block
$people{$hour} = $people{$hour} . ", " . $name
Finally, outside the loop, print the hash:
while ( my ($time, $names) = each(%people) ) {
print $time . ":00:00 " . $names ."\n";
}
(This is untested, but this is the basic approach I would take.)
Here's the full solution how to do it.
my @readings = (
"10:00:00 Bob",
"11:00:00 Tom",
"11:00:20 Fred",
"11:00:40 George",
"12:00:00 Bill",
);
my %hours;
for my $line (@readings) {
$line =~ /^(\d{2}).*?([a-zA-Z]+)/;
push(@{$hours{$1}}, $2);
}
for my $hour (sort keys %hours) {
print "$hour:00:00 ";
print join ", ", @{$hours{$hour}};
print "\n";
}
This results in:
10:00:00 Bob
11:00:00 Tom, Fred, George
12:00:00 Bill
精彩评论