How can I count overlapping substrings in Perl?_问答_开发者

i need to implement a program to count the occurrence of a substring in a string in perl. i have implemented it as follows

sub countnmstr
{
  $count =0;
  $count++ while $_[0] =~ /$_[1]/g;
  return $count;
}

$count = countnmstr("aaa","aa");

print "$count\n";

now this is what i would normally do. however, in the implementation above i want to count occur开发者_JAVA技巧rence of 'aa' in 'aaa'. here i get answer as 1 which seems reasonable but i need to consider the overlapping cases as well. hence the above case should give an answer as 2 since there are two 'aa's if we consider overlap.

can anyone suggest how to implement such a function??

Everyone is getting pretty complicated in their answers (d'oh! daotoad should have made his comment an answer!), perhaps because they are afraid of the goatse operator. I didn't name it, that's just what people call it. It uses the trick that the result of a list assignment is the number of elements in the righthand list.

The Perl idiom for counting matches is then:

 my $count = () = $_[0] =~ /($pattern)/g;

The goatse part is the = () =, which is an empty list in the middle of two assignments. The lefthand part of the goatse gets the count from the righthand side of the goatse. Note the you need a capture in the pattern because that's the list the match operator will return in list context.

Now, the next trick in your case is that you really want a positive lookbehind (or lookahead maybe). The lookarounds don't consume characters, so you don't need to keep track of the position:

 my $count = () = 'aaa' =~ /((?<=a)a)/g;

Your aaa is just an example. If you have a variable-width pattern, you have to use a lookahead. Lookbehinds in Perl have to be fixed width.

See ysth's answer ... I failed to realize that the pattern could consist solely of a zero width assertion and still work for this purpose.

You can use positive lookahead as suggested by others, and write the function as:

sub countnmstr {
    my ($haystack, $needle) = @_;
    my ($first, $rest) = $needle =~ /^(.)(.*)$/;
    return scalar (() = $haystack =~ /(\Q$first\E(?=\Q$rest\E))/g);
}

You can also use pos to adjust where the next search picks up from:

#!/usr/bin/perl

use strict; use warnings;

sub countnmstr {
    my ($haystack, $needle) = @_;
    my $adj = length($needle) - 1;
    die "Search string cannot be empty!" if $adj < 0;

    my $count = 0;
    while ( $haystack =~ /\Q$needle/g ) {
        pos $haystack -= $adj;
        $count += 1;
    }
    return $count;
}

print countnmstr("aaa","aa"), "\n";

Output:

C:\Temp> t
2

sub countnmstr
{
    my ($string, $substr) = @_;
    return scalar( () = $string =~ /(?=\Q$substr\E)/g );
}

$count = countnmstr("aaa","aa");

print "$count\n";

A few points:

//g in list context matches as many times as possible.

\Q...\E is used to auto-escape any meta characters, so that you are doing a substring count, not a subpattern count.

Using a lookahead (?= ... ) causes each match to not "consume" any of the string, allowing the following match to be attempted at the very next character.

This uses the same feature where a list assignment (in this case, to an empty list) in scalar context returns the count of elements on the right of the list assignment as the goatse/flying-lentil/spread-eagle/whatever operator, but uses scalar() instead of a scalar assignment to provide the scalar context.

$_[0] is not used directly, but instead copied to a lexical; a naive use of $_[0] in place of $string would cause the //g to start partway through the string instead of at the beginning if the passed string had a stored pos().

Update: s///g is faster, though not as fast as using index:

sub countnmstr
{
    my ($string, $substr) = @_;
    return scalar( $string =~ s/(?=\Q$substr\E)//g );
}

You could use a lookahead assertion in the regular expression:

sub countnmstr {
    my @matches = $_[0] =~ /(?=($_[1]))/g;

    return scalar @matches;
}

I suspect Sinan's suggestion will be quicker though.

you can try this, no more regex than needed.

$haystack="aaaaabbbcc";
$needle = "aa";
while ( 1 ){
    $ind = index($haystack,$needle);
    if ( $ind == -1 ) {last};
    $haystack = substr($haystack,$ind+1);
    $count++;
}
print "Total count: $count\n";

output

$ ./perl.pl
Total count: 4

If speed is an issue, the index approach suggested by ghostdog74 (with cjm's improvement) is likely to be considerably faster than the regex solutions.

use strict;
use warnings;

sub countnmstr_regex {
    my ($haystack, $needle) = @_;
    return scalar( () = $haystack =~ /(?=\Q$needle\E)/g );
}

sub countnmstr_index {
    my ($haystack, $needle) = @_;
    my $i = 0;
    my $tally = 0;
    while (1){
        $i = index($haystack, $needle, $i);
        last if $i == -1;
        $tally ++;
        $i ++;
    }
    return $tally;
}

use Benchmark qw(cmpthese);

my $size = 1;
my $h = 'aaa aaaaaa' x $size;
my $n = 'aa';

cmpthese( -2, {
    countnmstr_regex => sub { countnmstr_regex($h, $n) },
    countnmstr_index => sub { countnmstr_index($h, $n) },
} );

__END__

# Benchmarks run on Windows.
# Result using a small haystack ($size = 1).
                     Rate countnmstr_regex countnmstr_index
countnmstr_regex  93701/s               --             -66%
countnmstr_index 271893/s             190%               --

# Result using a large haystack ($size = 100).
                   Rate countnmstr_regex countnmstr_index
countnmstr_regex  929/s               --             -81%
countnmstr_index 4960/s             434%               --