I have a text file filled with sentences with unique pattern. The unique pattern is:
NAME [ e_NAME ]
simple rule: the "NAME" must follow after "e_" if the "e_" appearers inside the brackets!
The problem comes out when the string is complicated. I'll show the end poi开发者_JAVA百科nt situations that may be hard to analyse:
Lines that won't match the rule:
(1) NAME1[blabla+NAME2[blabla+e_BAD2]+e_NAME1] (2) NAME1[blabla] + NAME2[e_BAD2] (3) NAME1[NAME2[blabla+e_BAD2]] + NAME3[e_BAD3] (4) NAME1[e_NAME1BAD1] -> means it has to be only NAME1
Lines that match the rule:
(1) FOO1[blabla + 1] (2) [blalbla] + bla (3) bla + blabla (4) FOO1[ccc + ddd + FOO2[e_FOO2]] = 123 (5) FOO1[cc + FOO2[ dd ] ] + FOO3[e_FOO3]
I already asked this question but I couldn't catch this end points...
Edited after requirements were clarified
Either Text::Balanced or Regexp::Common might be useful. I initially posted an answer using the former but didn't like it very much. The following example uses Regexp::Common
and seems fairly straightforward.
use strict;
use warnings;
use Regexp::Common;
my $PRE = '[^[]*?';
my $VAR = '\w+';
my $BRACK = $RE{balanced}{-parens=>'[]'};
my $POST = '.*';
while (<DATA>){
my ($bad, $full);
# Brackets, if any, must balance
$bad = 1 unless s/\[/[/g == s/\]/]/g;
$full = $_;
until ($bad){
# Find some bracketed text and store all components.
my ($pre, $var, $brack, $post) =
$full =~ /^($PRE)($VAR)($BRACK)($POST)$/;
last unless defined $brack;
# Create a copy of the bracketed text, removing both the outer
# brackets and all instances of inner-bracketed text.
chop (my $clean = substr $brack, 1);
$clean =~ s/$BRACK/ /g;
# If e_FOO exists, FOO must equal $var.
$bad = 1 if $clean =~ /e_(\w+)/ and $1 ne $var;
# Remove the part of $full we've already checked.
substr($full, 0, length($pre) + length($var) + 1, '');
}
print if $bad;
}
# Your test data, with some trailing comments.
__DATA__
NAME1[blabla+NAME2[blabla+e_BAD2]+e_NAME1] NOT OK 1
NAME1[blabla] + NAME2[e_BAD2] NOT OK 2
NAME1[NAME2[blabla+e_BAD2]] + NAME3[e_BAD3] NOT OK 3
NAME1[e_NAME1BAD1] NOT OK 4
FOO1[blabla + 1] OK 1
[blalbla] + bla OK 2
bla + blabla OK 3
FOO1[ccc + ddd + FOO2[e_FOO2]] = 123 OK 4
FOO1[cc + FOO2[ dd ] ] + FOO3[e_FOO3] OK 5
Maybe you are looking for something like:
if ($string =~ /(\w+)\[e\\_(\w+)/ && $1 eq $2) {
print "Pattern '$1' contained in string '$string'\n";
}
Based on the accepted answer to your first question, I came up with this:
use strict;
use warnings;
while (<DATA>) {
my $l = $_;
while (s/(\w+)\[([^\[\]]*)\]//) {
my ($n, $chk) = ($1, $2);
unless ($chk =~ /\be_$n\b/) {
warn "Bad line: $l";
last;
}
}
}
The \b
checks for a word boundary. This version still doesn't check for unbalanced brackets, but it does seem to catch all the examples you gave, and will also complain when the e_NAME1
is inside another nested block, like so:
NAME1[stuff + NAME2[e_NAME1 + e_NAME2] + morestuff]
use Text::Balanced;
CPAN is wonderful.
精彩评论