I have the following input to a Perl script and I wish to get the first occurrence of NAME="..." strings in each of the <table>...</table>
structures.
The entire file is read into a single string and the regex acts on that input.
However, the regex always returns the last occurrence of NAME="..."
strings. Can anyone explain what is going on and how this can be fixed?
Input file:
ADSDF
<TABLE>
NAME="ORDERSAA"
line1
line2
NAME="ORDERSA"
line3
NAME="ORDERSAB"
</TABLE>
<TABLE>
line1
line2
NAME="ORDERSB"
line3
</TABLE>
<TABLE>
line1
line2
NAME="ORDERSC"
line3
</TABLE>
<TABLE>
line1
line2
NAME="ORDERSD"
line3
line3
line3
</TABLE>
<TABLE>
line1
line2
NAME="QUOTES2"
line3
NAME="QUOTES3"
NAME="QUOTES4"
line3
NAME="QUOTES5"
line3
</TABLE>
<TABLE>
line1
line2
NAME="QUOTES6"
NAME="QUOTES7"
NAME="QUOTES8"
NAME="QUOTES9"
line3
line3
</TABLE>
<TABLE>
NAME="MyName IsKhan"
</TABLE>
Perl Code starts here:
use warnings;
use strict;
my $nameRegExp = '(<table>((NAME="(.+)")|(.*|\n))*</table>)';
sub extractNames($$){
my ($ifh, $ofh) = @_;
my $fullFile;
read ($ifh, $fu开发者_如何学GollFile, 1024);#Hardcoded to read just 1024 bytes.
while( $fullFile =~ m#$nameRegExp#gi){
print "found: ".$4."\n";
}
}
sub main(){
if( ($#ARGV + 1 )!= 1){
die("Usage: extractNames infile\n");
}
my $infileName = $ARGV[0];
my $outfileName = $ARGV[1];
open my $inFile, "<$infileName" or die("Could not open log file $infileName");
my $outFile;
#open my $outFile, ">$outfileName" or die("Could not open log file $outfileName");
extractNames( $inFile, $outFile );
close( $inFile );
#close( $outFile );
}
#call
main();
Try this:
'(?><TABLE>\n+(?:(?!</TABLE>|NAME=).*\n+)*)NAME="([^"]+)"'
The (?:.*\n+)*
consumes any unwanted lines, while the embedded lookahead -- (?!</TABLE>|NAME=)
-- keeps it from overrunning the first NAME field or the end of the TABLE record. Just in case there's a record with no NAME field, I wrapped most of the expression in an atomic group -- (?>...)
-- to prevent pointless backtracking.
Notice that there's only one capturing group now. It's good practice to use them only when you really need to capture something; otherwise, use the non-capturing variety: (?:...)
.
EDIT: As to why your regex didn't work, the short answer is greediness. After matching the opening tag, this part takes over:
((NAME="(.+)")|(.*|\n))*
The part in the outermost parens can match anything: tags, NAME=
lines, linefeeds--even empty lines. Wrap that in a group controlled by a greedy *
, and now it matches everything. There's nothing in there to make it stop matching at the first NAME field, or even at the end of a record.
So it's actually "finding" every occurrence of NAME="..."
strings, but it's doing it in a single match attempt that consumes the entire input at once. With each iteration of the enclosing *
, the capture groups are overwritten; when it's done, the final NAME value -- MyName IsKhan
-- is what happens to be left in group 4.
I used a negative lookahead to check the greediness, but you can also do that more directly, by using a non-greedy quantifier. Here's how my regex would look with a reluctant *
in place of the negative lookahead:
'<TABLE>\n+(?:.*\n+)*?NAME="([^"]+)"'
Simply switching to a non-greedy quantifier wouldn't help with your regex though; you'd have to make some structural changes as well.
First of all, its a bad idea to parse XML with Regular Expressions. Second you need to change your regex to the following:
my $nameRegExp = '(<table>((NAME="(.+)?")|(.*?|\n))*?</table>)';
This way the regex becomes non greedy and should return the first occurence.
Try making your regex non-greedy:
my $nameRegExp = '(<table>((NAME="(.+?)")|(.*?|\n))*</table>)';
Even the above regex will not list all the NAME lines in the file. It will list just one NAME line (last one)from each <TABLE>...</TABLE>
block.
To list all the NAME lines you can do:
my $nameRegExp = 'NAME="(.+?)"';
and print $1
;
$/ = '</TABLE>';
while (<>) {
chomp;
@F = split "\n";
$g = 0;
for ($o = 0; $o <= $#F; $o++) {
if ($F[$o] =~ /^NAME=/) {
$F[$o] =~ s/^NAME=//g;
$v = $F[$o];
$g = 1;
last;
}
}
if ($g) { print $v."\n"; }
}
output
$ perl myscript.pl file
"ORDERSAA"
"ORDERSB"
"ORDERSC"
"ORDERSD"
"QUOTES2"
"QUOTES6"
"MyName IsKhan"
the whole gist of it: use </TABLE>
as record separator and newline as field separator. Go through each field and find NAME=
. If found, substitute and get the string after the =
sign.
精彩评论