开发者

POSIX Regular Expressions Limit Repetitions

开发者 https://www.devze.com 2023-02-25 06:06 出处:网络
I am trying to grep for a maximum number of repetitions allowed on my input string and can\'t seem to get it working.

I am trying to grep for a maximum number of repetitions allowed on my input string and can't seem to get it working.

The input file has three lines with 3,5 and 7 repetitions of "pq" respectively. The >=3, >=5 expressions are working fine, but "between 3 and 5" expression {3,5} shows the line with seven repetitions as well.

DEV /> cat开发者_JAVA百科 input.txt
pq -- One occurance of pq
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq


DEV /> grep "\(pq\)\{3,\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq


DEV /> grep "\(pq\)\{5\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq

DEV /> grep "\(pq\)\{3,5\}" input.txt
pqpqpqpqpq -five occurances of pq
pqpqpqpqpqpqpq -- seven occurances of pq

Am I doing something wrong or is this the expected behavior?

If this is the expected behavior ( as the string with 7 PQs has between 3-5 PQs),

1) in what cases is the maximum repetitions applicable? What would be the difference between {3,5} and {3,} (greater than 3)?

2) I can anchor my regular expressions with "^", but what if my string does not end with "pq" and has more text?


If a line has seven repetitions of anything, it also therefore contains between 3–5 repetitions of that thing, and at several points, no less.

Use match anchors if you expect matches to be anchored. Otherwise, of course, they are not.

The practical difference between /X{3,}/ and /X{3,5}/ is the how long of a string it matches — the extent (or span) of the match. If all you are looking for is a boolean yes/no responses and there is nothing further in your pattern, it does not make much of a difference; in fact, a moderately clever regex engine will return early if it knows it is safe to do so.

One way to see the difference is with GNU grep’s ‑o or ‑‑only‐matching option. Watch:

$ echo 123456789 | egrep -o '[0-9]{3}'
123
456
789
$ echo 123456789 | egrep -o '[0-9]{3,}'
123456789
$ echo 123456789 | egrep -o '[0-9]{3,5}'
12345
6789
$ echo 123456789 | egrep -o '[0-9]{3,5}[2468]'
123456
$ echo 123456790 | egrep -o '[0-9]{3,5}[13579]'
12345
6789

To understand how those last two work, it is useful to get a trace of the regex engine’s attempts, including backtracking steps. You can do this using Perl in this way:

$ perl -Mre=debug -le 'print $& while 1234567890 =~ /\d{3,5}[13579]/g'
Compiling REx "\d{3,5}[13579]"
Final program:
   1: CURLY {3,5} (4)
   3:   DIGIT (0)
   4: ANYOF[13579][] (15)
  15: END (0)
stclass DIGIT minlen 4 
Matching REx "\d{3,5}[13579]" against "1234567890"
Matching stclass DIGIT against "1234567" (7 chars)
   0 <> <1234567890>         |  1:CURLY {3,5}(4)
                                  DIGIT can match 5 times out of 5...
   5 <12345> <67890>         |  4:  ANYOF[13579][](15)
                                    failed...
   4 <1234> <567890>         |  4:  ANYOF[13579][](15)
   5 <12345> <67890>         | 15:  END(0)
Match successful!
12345
Matching REx "\d{3,5}[13579]" against "67890"
Matching stclass DIGIT against "67" (2 chars)
   5 <12345> <67890>         |  1:CURLY {3,5}(4)
                                  DIGIT can match 5 times out of 5...
  10 <1234567890> <>         |  4:  ANYOF[13579][](15)
                                    failed...
   9 <123456789> <0>         |  4:  ANYOF[13579][](15)
                                    failed...
   8 <12345678> <90>         |  4:  ANYOF[13579][](15)
   9 <123456789> <0>         | 15:  END(0)
Match successful!
6789
Freeing REx: "\d{3,5}[13579]"

When you have additional constraints about what comes after the match, then which type of repetition you choose can make a big difference. Here I’ll impose a constraint on where each match is allowed to finish, by saying it needs to end before an odd digit:

$ perl -le 'print $& while 1234567890 =~ /\d{3}(?=[13579])/g'
234
678
$ perl -le 'print $& while 1234567890 =~ /\d{3,5}(?=[13579])/g'
1234
5678
% perl -le 'print $& while 1234567890 =~ /\d{3,}(?=[13579])/g'
12345678

So when you have things that have to come afterwards, it can make a great deal of difference. When you are just deciding whether the entire line matches something, it may not be as important.


This is expected behavior. The string "pqpqpqpqpqpqpq" does in fact have between three and five repetitions of "pq", and then a few more for good measure. You may want to try anchoring your regular expression, something like ^\(pq\)\{3,5\}$.


Edit to match edited question:

  1. The maximum is applicable in all situations. What is happening is that grep is matching 5 of the 7 repetitions of "pq" (most likely the first five), and since it found a match it prints out the line.
  2. You'll have to figure out a way to change your regex to match what you want and not match what you don't. For example, to match a line starting with 3–5 repetitions of "pq", you might do something like this: ^\(pq\){3,5}\($|[^p]|p$|p[^q]\). That matches 3–5 "pq"s followed immediately by end-of-line or any-character-other-than-"p" or "p"-followed-by-end-of-line or "p"-followed-by-any-character-other-than-"q".
0

精彩评论

暂无评论...
验证码 换一张
取 消