help with perl regex rules_问答_开发者_运维开发者技术经验分享

I would need some help with a regex issue in perl. I need to match non_letter characters "nucleated" around letter characters string (of size one).

That is to say... I have a string like

CDF((E)TR)FT

and开发者_如何学C I want to match ALL the following:

C, D, F((, ((E), )T, R), )F, T.

I was trying with something like

/([^A-Za-z]*[A-Za-z]{1}[^A-Za-z]*)/

but I'm obtaining:

C, D, F((, E), T, R), F, T.

Is like if once a non-letter characters has been matched it can NOT be matched again in another matching.

How can I do this?

A little late on this. Somebody has probably proposed this already.

I would consume the capture in the assertion to the left (via backref) and not consume the capture in the assertion to the right. All the captures can be seen, but the last one is not consumed, so the next pass continues right after the last atomic letter was found.

Character class is simplified for clarity:
/(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/

(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 1
(\1[A-Z]) # capture grp 2, consume capture group 1, plus atomic letter
(?=([^A-Z]*)) # ahead is optional non A-Z characters, captured in grp 3

Do globally, in a while loop, combined groups $2$3 (in that order) are the answer.

Test:

$samp = 'CDF((E)TR)FT';

while ( $samp =~ /(?=([^A-Z]*))(\1[A-Z])(?=([^A-Z]*))/g )
{
   print "$2$3, ";
}

output:

C, D, F((, ((E), )T, R), )F, T,

The problem is that you are consuming your characters or non letter characters the first time you encounter them, therefore you can't match all that you want. A solution would be to use different regexes for different patterns and combine the results at the end so that you could have your desired result :

This will match all character starting with a non character followed by a single character but NOT followed by a non character

[^A-Z]+[A-Z](?![^A-Z])

This will match a character enclosed by non characters, containing overlapping results :

(?=([^A-Z]+[A-Z][^A-Z]+))

This will match a character followed by one or more non characters only if it is not preceded by a non character :

(?<![^A-Z])[A-Z][^A-Z]+

And this will match single characters which are not enclosed to non characters

(?<![^A-Z])[A-Z](?![^A-Z])

By combining the results you will have the correct desired result:

C,D,T, )T, )F, ((E), F((, R)

Also if you understand the small parts you could join this into one Regex :

#!/usr/local/bin/perl

use strict;

my $subject = "0C0CC(R)CC(L)C0";

while ($subject =~ m/(?=([^A-Z]+[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z][^A-Z]+))|(?=((?<![^A-Z])[A-Z](?![^A-Z])))|(?=([^A-Z]+[A-Z](?![^A-Z])))/g) {
# matched text = $1, $2, $3, $4
print $1, " " if defined $1;
print $2, " " if defined $2;
print $3, " " if defined $3;
print $4, " " if defined $4;
}

Output :

0C0 0C C( (R) )C C( (L) )C0

You're right, once a character has been consumed in a regex match, it can't be matched again. In regex flavors that fully support lookaround assertions, you could do it with the regex

(?<=(\P{L}*))\p{L}(?=(\P{L}*))

where the match result would be the letter, and $1 and $2 would contain the non-letters around it. Since they are only matched in the context of lookaround assertions, they are not consumed in the match and can therefore be matched multiple times. You then need to construct the match result as $1 + $& + $2. This approach would work in .NET, for example.

In most other flavors (including Perl) that have limited support for lookaround, you can take a mixed approach, which is necessary because lookbehind expressions don't allow for indefinite repetition:

\P{L}*\p{L}(?=(\P{L}*))

Now $& will contain the non-letter characters before the letter and the letter itself, and $1 contains any non-letter characters that follow the letter.

while ($subject =~ m/\P{L}*\p{L}(?=(\P{L}*))/g) {
    # matched text = $& . $1
}

Or, you could do it the hard way and tokenize first, then process the tokens:

#!/usr/bin/perl
use warnings;
use strict;

my $str = 'CDF((E)TR)FT';
my @nucleated = nucleat($str);
print "$_\n" for @nucleated;

sub nucleat {
    my($s) = @_;
    my @parts;   # return list stored here

    my @tokens = grep length, split /([a-z])/i, $s;

    # bracket the tokens with empty strings to avoid warnings
    unshift @tokens, '';
    push @tokens, '';

    foreach my $i (0..$#tokens) {
        next unless $tokens[$i] =~ /^[a-z]$/i; # one element per letter token       
        my $str = '';

        if ($tokens[$i-1] !~ /^[a-z]$/i) { # punc before letter
            $str .= $tokens[$i-1];
        }

        $str .= $tokens[$i];               # the letter

        if ($tokens[$i+1] !~ /^[a-z]$/i) { # punc after letter
            $str .= $tokens[$i+1];
        }

        push @parts, $str;
    }

    return @parts;
}