开发者

How can I use Perl to validate this data containing balanced text?

开发者 https://www.devze.com 2022-12-12 00:51 出处:网络
I have a text file filled with sentences with unique pattern.The unique pattern is: NAME [ e_NAME ] simple rule: the \"NAME\" must follow after \"e_\" if the \"e_\" appearers inside the brackets!

I have a text file filled with sentences with unique pattern. The unique pattern is:

NAME [ e_NAME ]

simple rule: the "NAME" must follow after "e_" if the "e_" appearers inside the brackets!

The problem comes out when the string is complicated. I'll show the end poi开发者_JAVA百科nt situations that may be hard to analyse:

Lines that won't match the rule:

(1) NAME1[blabla+NAME2[blabla+e_BAD2]+e_NAME1]
(2) NAME1[blabla] + NAME2[e_BAD2]
(3) NAME1[NAME2[blabla+e_BAD2]] + NAME3[e_BAD3]
(4) NAME1[e_NAME1BAD1] -> means it has to be only NAME1

Lines that match the rule:

(1) FOO1[blabla + 1]
(2) [blalbla] + bla
(3) bla + blabla
(4) FOO1[ccc + ddd + FOO2[e_FOO2]] = 123
(5) FOO1[cc + FOO2[ dd ] ] + FOO3[e_FOO3]

I already asked this question but I couldn't catch this end points...


Edited after requirements were clarified

Either Text::Balanced or Regexp::Common might be useful. I initially posted an answer using the former but didn't like it very much. The following example uses Regexp::Common and seems fairly straightforward.

use strict;
use warnings;
use Regexp::Common;

my $PRE   = '[^[]*?';
my $VAR   = '\w+';
my $BRACK = $RE{balanced}{-parens=>'[]'};
my $POST  = '.*';

while (<DATA>){
    my ($bad, $full);

    # Brackets, if any, must balance
    $bad = 1 unless s/\[/[/g == s/\]/]/g;

    $full = $_;
    until ($bad){
        # Find some bracketed text and store all components.
        my ($pre, $var, $brack, $post) =
            $full =~ /^($PRE)($VAR)($BRACK)($POST)$/;
        last unless defined $brack;

        # Create a copy of the bracketed text, removing both the outer
        # brackets and all instances of inner-bracketed text.
        chop (my $clean = substr $brack, 1);
        $clean =~ s/$BRACK/ /g;

        # If e_FOO exists, FOO must equal $var.
        $bad = 1 if $clean =~ /e_(\w+)/ and $1 ne $var;

        # Remove the part of $full we've already checked.
        substr($full, 0, length($pre) + length($var) + 1, '');
    }

    print if $bad;
}

# Your test data, with some trailing comments.    
__DATA__
NAME1[blabla+NAME2[blabla+e_BAD2]+e_NAME1]               NOT OK 1
NAME1[blabla] + NAME2[e_BAD2]                            NOT OK 2
NAME1[NAME2[blabla+e_BAD2]] + NAME3[e_BAD3]              NOT OK 3
NAME1[e_NAME1BAD1]                                       NOT OK 4
FOO1[blabla + 1]                                         OK 1
[blalbla] + bla                                          OK 2
bla + blabla                                             OK 3
FOO1[ccc + ddd + FOO2[e_FOO2]] = 123                     OK 4
FOO1[cc + FOO2[ dd ] ] + FOO3[e_FOO3]                    OK 5


Maybe you are looking for something like:

 if ($string =~ /(\w+)\[e\\_(\w+)/ &&  $1 eq $2) {
     print "Pattern '$1' contained in string '$string'\n";
 }


Based on the accepted answer to your first question, I came up with this:

use strict;
use warnings;

while (<DATA>) {
   my $l = $_;
   while (s/(\w+)\[([^\[\]]*)\]//) {
      my ($n, $chk) = ($1, $2);
      unless ($chk =~ /\be_$n\b/) {
         warn "Bad line: $l";
         last;
      }
   }
}

The \b checks for a word boundary. This version still doesn't check for unbalanced brackets, but it does seem to catch all the examples you gave, and will also complain when the e_NAME1 is inside another nested block, like so:

NAME1[stuff + NAME2[e_NAME1 + e_NAME2] + morestuff]


use Text::Balanced;

CPAN is wonderful.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号