开发者

Regex: Matching against groups in different order without repeating the group

开发者 https://www.devze.com 2022-12-26 16:10 出处:网络
Let\'s say I have two strings like this: XABY XBAY A simple regex that matches both would go like this:

Let's say I have two strings like this:

XABY
XBAY

A simple regex that matches both would go like this:

X(AB|BA)Y

However, I have a case where A and B are complicated strings, and I'm looking for a way to avoid having to specify each of them twice (on each side of the |). Is there a way to do this (th开发者_开发百科at presumably is simpler than having to specify them twice)?

Thanks


X(?:A()|B()){2}\1\2Y

Basically, you use an empty capturing group to check off each item when it's matched, then the back-references ensure that everything's been checked off.

Be aware that this relies on undocumented regex behavior, so there's no guarantee that it will work in your regex flavor--and if it does, there's no guarantee that it will continue to work as that flavor evolves. But as far as I know, it works in every flavor that supports back-references. (EDIT: It does not work in JavaScript.)

EDIT: You say you're using named groups to capture parts of the match, which adds a lot of visual clutter to the regex, if not real complexity. Well, if you happen to be using .NET regexes, you can still use simple numbered groups for the "check boxes". Here's a simplistic example that finds and picks apart a bunch of month-day strings without knowing their internal order:

  Regex r = new Regex(
    @"(?:
        (?<MONTH>Jan|Feb|Mar|Apr|May|Jun|Jul|Sep|Oct|Nov|Dec)()
        |
        (?<DAY>\d+)()
      ){2}
      \1\2",
    RegexOptions.IgnorePatternWhitespace);

  string input = @"30Jan Feb12 Mar23 4Apr May09 11Jun";
  foreach (Match m in r.Matches(input))
  {
    Console.WriteLine("{0} {1}", m.Groups["MONTH"], m.Groups["DAY"]);
  }

This works because in .NET, the presence of named groups has no effect on the ordering of the non-named groups. Named groups have numbers assigned to them, but those numbers start after the last of the non-named groups. (I know that seems gratuitously complicated, but there are good reasons for doing it that way.)

Normally you want to avoid using named and non-named capturing groups together, especially if you're using back-references, but I think this case could be a legitimate exception.


You can store regex pieces in variables, and do:

A=/* relevant regex pattern */
B=/* other regex pattern */
regex = X($A$B|$B$A)Y

This way you only have to specify each regex once, on its own line, which should make it easier to maintain.

Sidenote: You're trying to find permutations, which is ok since you're only looking at 2 subregexes. But if you wanted to add a third (or fourth), your regex permutations grow drastically - (abc|acb|bac|bca|cab|cba) - or worse. If you need to go down the road of permutations, there's some good discussion on that here on stackoverflow. It's for letter permutation, and the solutions use awk/bash/perl, but that at least gives you a starting point.


try this

X((A|B){2})Y


If there are several strings, with any kind of characters in there, you'll be better with:

X(.)+Y

Only numbers then

X([0-9])+Y

Only letters

X([a-zA-Z])+Y

Letters and numbers

X([a-zA-Z][0-9])+Y
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号