I have a text file that I extracted from a PDF file. It's arranged in a tabular format; this is part of it:
DATE SESS PROF1 PROF2 COURSE SEC GRADE COUNT
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 A 3
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 A- 2
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B 4
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B+ 2
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 B- 1
2007/09 1 RODRIGUEZ TANIA DACSB 06500 001 WU 1
2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1
2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1
2007/09 1 FUENTES TANIA DACSB 06500 002 A 3
2007/09 1 FUENTES TANIA DACSB 06500 002 A- 8
2007/09 1 FUENTES ALEXA DACSB 06500 002 B 5
2007/09 1 FUENTES ALEXA DACSB 06500 002 B+ 3
2007/09 1 FUENTES ALEXA DACSB 06500 002 B- 1
2007/09 1 FUENTES ALEXA DACSB 06500 002 C 1
2007/09 1 FUENTES ALEXA DACSB 06500 002 C+ 1
2007/09 1 LIGGINS FREDER DACSB 06500 003 A 1
Where the first line is the columns names, and the rest of the lines are the data.
there are 8 columns which I want to get, at first it seemed very easy by splitting with split(/\s+/, ...)
for each line I read, but then,I noticed that in some lines there are additional spaces, for example:
2007/09 1 NOOB ADRIENNE JOSH ROGER DBIOM 10000 125 C+ 1
Sometimes the data for a 开发者_开发技巧certain column is optional as you can see it.
The problem is complex, but it's not unsolvable. It seems to me that course will always contain a space between the alpha code and the numeric code and that the prof names will also always contain a space. But then you're pretty much screwed if somebody has a two-part last name like "VAN DYKE".
A regex would describe this record:
my $record_exp
= qr{ ^ \s*
(\d{4}/\d{2}) # yyyy/mm date
\s+
(\d+) # any number of digits
\s+
(\S+ \s \S+) # non-space cluster, single space, non-space cluster
\s+
# sames as last, possibly not there, separating spaces are included
# in the conditional, because we have to make sure it will start
# right at the next rule.
(?:(\S+ \s \S+)\s+)?
# a cluster of alpha, single space, cluster of digits
(\p{Alpha}+ \s \d+)
\s+ # any number of spaces
(\S+) # any number of non-space
\s+ # ditto..
(\S+)
\s+
(\S+)
}x;
Which makes the loop a lot easier:
while ( <$input> ) {
my @fields = m{$record_exp};
# ... list of semantic actions here...
}
But you could also store it into structures, knowing that the only variable part of the data is the profs:
use strict;
use warnings;
my @records;
<$input>; # bleed the first line
while ( <$input> ) {
my @fields = split; # split on white-space
my $record = { date => shift @fields };
$record->{session} = shift @fields;
$record->{profs} = [ join( ' ', splice( @fields, 0, 2 )) ];
while ( @fields > 5 ) {
push @{ $record->{profs} }, join( ' ', splice( @fields, 0, 2 ));
}
$record->{course} = splice( @fields, 0, 2 );
@$record{ qw<sec grade count> } = @fields;
push @records, $record;
}
Believe it ambiguous :
if PROF1 can contain spaces, how do you know where it ends and where PROF2 begins? What if PROF2 also contains a space? Or 3 spaces ..
You probably can't even tell yourself, and if you can it's because you can tell the difference between a first-name and a surname.
If you're on Linux/Unix, try running text2pdf on the pdf.. might give you better results.
Looks to me like the first four columns and last 5 columns are always present and the 5th and 6th (prof2) columns are optional
So split the line as you were attempting, pull off the first four and last five elements from the resulting array, then whatever remains is your 5th column and 6th columns
If however either the prof1 or the prof2 entry can be missing, you're stuck - your file format is ambiguous
There is nothing that says you must use only a single regex. You can go prune off bits of your line in chunks if that makes it easier to handle the weird parts.
I would probably still use split()
, but then access the data thusly:
my @values = split '\s+', $string;
my $date = $values[0];
my $sess = $values[1];
my $count = $values[-1];
my $grade = $values[-2];
my $sec = $values[-3];
my $course = $values[-4];
my @profs = @values[2..($#values-5)];
With this construct you don't have to worry about how many profs you have. Even if you have none, the other values will all work fine (and you'll get an empty array for your profs).
精彩评论