开发者

Why is the rightmost character captured in backreference when using a character class with quantifiers?

开发者 https://www.devze.com 2023-04-06 20:24 出处:网络
If I have pattern ([a-z]){2,4} and string \"ab\", what would I expect to see in backre开发者_如何学编程ference \\1 ?

If I have pattern ([a-z]){2,4} and string "ab", what would I expect to see in backre开发者_如何学编程ference \1 ?

I'm getting "b", but why "b" rather than "a"?

I'm sure there is a valid explanation, but reading around various sites explaining regexes, I haven't found one. Anybody?


I'm not sure why nobody put this as an answer, but just for anyone hitting this page with a similar question, the answer is essentially that this regex:

([a-z]){2-4}

will match a single character between a and z at least 2 and as many as 4 times. It will match each character separately, overwriting anything previously matched and stored into the backreference (that is, whatever is between the () characters in the expression).

A similar expression (suggested in the comments on the question):

([a-z]{2,4})

moves the back-reference to surround the entire match (2-4 characters a-z) instead of a single character.

The parentheses represent a capture into a back-reference. When the repetition is inside the capture (the second example), it will capture all characters that make up that repetition. When the repetition is outside the capture (the first example), it will capture one letter, then repeat the process, capturing the next letter into the same back-reference, thus overwriting it. In this case, it will then repeat that process up to 2 more times, overwriting it each time.

So, matching against the target abc will result in \1 equaling c. Matching the target against abcd will result in \1 equaling d. With more letters, and depending upon the function (and language) used to run the regular expression, the target abcde might fail to match, or might result in the back-reference \1 equaling d (because the e is not part of the match).

The first example expression can be used to get abc or abcd if you use the whole match back-reference (often times $& or $0, but also \& or \0 and in Tcl, just an & character) - this returns the entire string matched by the entire regular expression.

0

精彩评论

暂无评论...
验证码 换一张
取 消