Is there a compelling reason to use quantifiers in Perl regular expressions instead of just repeating the character?_问答_开发者

I was performing a code review for a colleague and he had a regular expression that looked like this:

if ($value =~ /^\d\d\d\d$/) {
    #do stuff
}

I told him he should change it to:

if ($value =~ /^\d{4}$/) {
    #do stuff
}

To which he replied that he preferred the first for readability (I find the second more r开发者_如何学编程eadable, but that's a religious debate I'll save for another day).

My question: is there an actual benefit to one over the other?

There's no such thing as absolute readability. There's what people can individually recognize, which is why people often understand their code while nobody else can. If he never uses quantifiers, he's always going to think quantifiers are hard to read because he never learns to grok them.

I most often find that people say "more readable" when they really mean "that's what I know already" or "that's what I wrote the first time". That's not necessarily the case here, though.

An absolute quantifier like {4} is just easier to specify and communicate to other programmers. Who wants to count the number of \ds by hand? You write code for other people to read, so don't make their life harder.

However, you might have missed the bug in that code because you were focused on the quantifier issue. The $ anchor allows a newline at the end of the string, and if a Perl Best Practices zealot comes along and blindly adds /xsm to all regexes (a painful experience I've seen more than a few times), that $ allows even more invalid output. You probably want the \z absolute end-of-string anchor instead.

Not that it happened in your case, but code reviews tend to turn into style or syntax reviews (because those are easier to notice) and actually miss the point of checking for proper and intended behavior and correct design. Often the style problems aren't worth worrying about considering all of the other ways you could spend time to improve code. :)

They do the exact same thing, so as far as practicality it's a matter of preference. Is there a tiny performance difference one way or the other? Who knows but it's surely insignificant.

The quantifiers are more useful (and required) when the pattern length isn't fixed, for example \d{12,16}, \d{2,}, etc.

I prefer \d{4} which is easier for my brain to parse than \d\d\d\d

Also what if you're matching a character class rather than a simple digit? [aeiouy0-9]{4} or [aeiouy0-9][aeiouy0-9][aeiouy0-9][aeiouy0-9] ?

I'm just going to sidestep the issue of readability for now.

First lets look at what each version compiles down to.

perl -Mre=debug -e'/^\d{4}$/'

Compiling REx "^\d{4}$"
synthetic stclass "ANYOF[0-9][{unicode_all}]".
Final program:
   1: BOL (2)
   2: CURLY {4,4} (5)
   4:   DIGIT (0)
   5: EOL (6)
   6: END (0)
anchored ""$ at 4 stclass ANYOF[0-9][{unicode_all}] anchored(BOL) minlen 4 
Freeing REx: "^\d{4}$"

perl -Mre=debug -e'/^\d\d\d\d$/'

Compiling REx "^\d\d\d\d$"
Final program:
   1: BOL (2)
   2: DIGIT (3)
   3: DIGIT (4)
   4: DIGIT (5)
   5: DIGIT (6)
   6: EOL (7)
   7: END (0)
anchored ""$ at 4 stclass DIGIT anchored(BOL) minlen 4 
Freeing REx: "^\d\d\d\d$"

Now I'm going to see how well each version performs.

#! /usr/bin/env perl
use Benchmark qw':all';

cmpthese( -10, {
  'loop' => sub{ 1234 =~ /^\d{4}$/ },
  'repeat' => sub{ 1234 =~ /^\d\d\d\d$/ }
});

           Rate   loop repeat
loop   890004/s     --   -10%
repeat 983825/s    11%     --

While the /^\d\d\d\d$/ does consistently run faster, it isn't significantly faster. Which really just leaves it down to readability.

Let's take this example to the extreme:

/^\d{32}$/;
/^\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d\d$/;

I don't think there are many people who would argue that the second example is easier to read.

If we take it to the other extreme, the first style seems downright redundant.

/^\d{1}$/;
/^\d$/;

So what it really comes down to, is how many repetitions of \d, before your preference switches from just repeating the \d, to using a quantifier.

Any repetition of more than 3 or 4 will be hard to count at a glance. I consider this a compelling reason. On top of that, using the quantifier is a "denser" way to express the repeated information. To me, it's like the difference between copy-and-paste code "reuse" versus writing truly reusable code.

It's best to think that when he wants to find a set of 10+ letters he will have to use the quantifier rather than repetition, it's better to get used to the right way, besides, if he insists on using repetition for larger sets of characters, someone will have some trouble while trying to count them, which would not be needed if it was marked with a quantifier.

{4} is easier to maintain than \d\d\d\d because it scales better. For example, if you later need to change it to match 11 digits, you could simply change the 4 to an 11, instead of having to add 14 characters to your regex.

Like many things, it is a matter of how far you want to take it.

A real example.

Compare:

my @lines = $header =~ m/([^\n\r]{13}|[^\n\r]+)/g; #split header into groups of up to 13 characters

my @lines = $header =~ m/([^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r][^\n\r]|[^\n\r]+)/g; #split into groups of up to 13 characters

Can you still find the pipe '|'?

I would be likely to use either form, depending on the circumstances.

Let's ignore the strawman complexity of custom character-classes repeated 96 times all on one line, and instead focus on nicely written code.

Consider:

$foo =~ m{
        (\d\d\d\d)
    [ ] (\d\d\d?)
    [ ] (\w\w)
}x;

I've used code like this to parse data from weather sensors. I use this format because it closely matches the manufacturer's documentation. This works pretty well for "fixed width" data formats that don't quite live up to the promise of fixed width fields (this is distressingly common in practice).

You can argue that I should put the spaces on separate lines or on the same line as the preceding field, rather than on line with the subsequent field. But that is just formatting, and is truly a problem for perltidy.

In other cases, I have used code like this:

$foo =~ m{ 
        ( \d{4}   )
    [ ] ( \d{2,3} )
    [ ] ( \w{2}   )
}x;

To keep the above readable, you've got to add more whitespace, and play with formatting a bit more.

The second style scales with complexity better -- adding custom character classes and wide fields does not break readability.

The most important thing is to be consistent within a given regex. IOW, never do this:

$foo =~ m{ 
        ( \d\d\d\d )
    [ ] ( \d{2,3}  )
    [ ] ( \w\w     )
}x;

Ultimately, code performs two functions. The most well known function is that it tells the computer what to do. But the most important, yet largely overlooked function of code is to tell the maintenance programmer what the computer is doing.

About readability... some Perl programmers uses very rare features, hoping them to be readable, however, it requires the understanding of that rare feature.

There are many regexp newbies who do not understand what {4} is.

About benefits, the second one may be better because it takes less array elements in the regexp engine. Unless you are a Real Programmer, you won't be optimizing performance to nanoseconds.