开发者

in perl how to find substring that doesn't match a pattern

开发者 https://www.devze.com 2023-01-07 10:16 出处:网络
I need to find the complement of this: $_ = \'aaaaabaaabaaabacaaaa\'; while( /([a][a][a][a])/gc){ next if pos()%4 != 0;

I need to find the complement of this:

$_ = 'aaaaabaaabaaabacaaaa';

while( /([a][a][a][a])/gc){
    next if pos()%4 != 0;
    my $b_pos = (pos()/4)-1;
    print " aaaa at :$b_pos\n";
}

That is, a suite of 4 caracters that is not 'aaaa'.

The following doesn't work

$_ = 'aaaaabaaabaaabacaaaa';

while( /([^a][^a][^a][^a])/gc){
    my $b_pos = (pos()/4)-1;
    print "not a at :$b_pos\n";
}

Of course I can do this

$_ = 'aaaaabaaabaaabacaaaa';

while( /(....)/gc){
    next if $1 eq 'aaaa';
    my $b_pos = (pos()/4)-1;
    print "$1 a at :$b_pos\n";
}

Isn't there a more direct way?

To clarify the expected result, I need to find all 4 letter suite that are not 'aaaa' as开发者_Python百科 well as there position.

1st code outputs

 aaaa at :0
 aaaa at :4

2nd code should output

not aaaa at :1
not aaaa at :2
not aaaa at :3

3rd code output, is what I'm looking for

abaa at :1
abaa at :2
abac at :3

I understand I haven't been clear enough, please receive my appologies.

What I'm trying to acheive is like dividing a string in groups of 4 letters, getting the value and position of the groups that doesn't match the pattern.

My third code gives me the expected result. It reads the string 4 letter at the time and process the those that aren't 'aaaa'.

I also found out, thank to all of your suggestions, that my first code doesn't work as expected, it should skip if pos()%4 != 0, which would mean that the pattern spans over two groups of 4. I corrected the code.

Against all expectations, from me and others, the following doesn't ouput anything at all

/[^a]{4}/

I should probably stick with my 3rd code.


/(?!aaaa)/

This is a negative lookahead which matches at the first position where the pattern aaaa doesn't match.

Alternatively,

/[^a]{4}/

will match 4 characters together which are all not a.


EDIT: After some more fiddling and thought I found the proper solution, I'll leave the previous answer for reference...

It seems /aaaa(?!aaaa)....|(?!aaaa)..../gc is the complement of /aaaa/ for your purposes:

$_ = 'aaaaabaaabaaabacaaaa';
while( /aaaa(?!aaaa)....|(?!aaaa)..../gc ){
    my $b_pos = (pos()/4)-1;
    print substr($_,$b_pos*4,4)." at :$b_pos\n";
}

Gives as result:

abaa at :1
abaa at :2
abac at :3

Previous answer

The negative lookahead does not interact with "block" iteration, even in your small sample input:

use POSIX floor;
$_ = 'aaaaabaaabaaabacaaaa';
while( /(?!aaaa)..../gc ){
    my $b_pos = floor(pos()/4);
    print " !aaaa at :$b_pos str:".substr($_,$b_pos*4,4);
    print " c_pos:".(pos()-4)." str:".substr($_,(pos()-4),4)."\n";
}

With output:

 !aaaa at :1 str:abaa c_pos:2 str:aaab
 !aaaa at :2 str:abaa c_pos:6 str:aaab
 !aaaa at :3 str:abac c_pos:10 str:aaab
 !aaaa at :4 str:aaaa c_pos:14 str:acaa

This is because the lookahead will be evaluated character by character, not in blocks of 4. This means that in the case of aaaabaaa, it will check aaaa then aaab which will not lookahead match aaaa thus those will be consumed, not baaa as one would possibly want...

However judicious use of map, grep and split solve the problem:

my $c = 0;
print "!aaaa at positions: ", 
      join ",", map { $$_[1] } 
                    grep { $$_[0] !~ /aaaa/ } 
                         map { [$_, $c++ ] } 
                             grep /./, split /(.{4})/, $_;
print "\n";

results in:

!aaaa at positions: 1,2,3

Explanation:

  1. split /(.{4})/, $_ will split the input into a list of blocks of 4 characters
  2. However usage of regexp capture in split may cause empty blocks to be on the list, thus we eliminate them using grep /./
  3. Now we create tuples of the input plus the block number (thus we need a $c initialized to 0...)
  4. Now we filter the elements which do not match 'aaaa'
  5. Now we map to retrieve just the block number...

To match your exact output:

my $c = 0; 
print "",  
  join "\n",  
       map { $$_[0]." at: ".$$_[1] }  
           grep { $$_[0] !~ /aaaa/ }  
                map { [$_, $c++ ] }  
                    grep /./, split /(.{4})/, $_; 
print "\n"; 


The complemented binding:

$string !~ /pattern/;


How about this:

/[^a]{4}/


Try this:

/(?:(?!aaaa)[a-z]){4}/g

Before each character is matched, the lookahead ensures they aren't aaaa.

0

精彩评论

暂无评论...
验证码 换一张
取 消