开发者

This RegEx captures wrong number of groups

开发者 https://www.devze.com 2022-12-20 19:14 出处:网络
I have to parse a string and capture some values: FREQ=WEEKLY;WKST=MO;BYDAY=2TU,2WE I want to capture 2 groups:

I have to parse a string and capture some values:

FREQ=WEEKLY;WKST=MO;BYDAY=2TU,2WE

I want to capture 2 groups:

grp 1: 2, 2
grp 2: TU, WE

The Numbers represents intervals. TU, WE represents weekdays. I need both.

I'm using this code:

private final static java.util.regex.Pattern regBYDAY = java.util.regex.Pattern.compile(".*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*");

String rrule = "FREQ=WEEKLY;WKST=MO;BYDAY=2TU,2WE";
java.util.regex.Matcher result = regBYDAY.matcher(rrule);
if (result.matches())
{
    int grpCount = result.groupCount();
    for (int i = 1; i < grpCount; i++)
 开发者_运维问答   {
        String g = result.group(i);
        ...
    }
}

grpCount == 2 - why? If I read the java documentation correctly (that little bit) I should get 5? 0 = the whole expression, 1,2,3,4 = my captures 2,2,TU and WE.

result.group(1) == "2";

I'm a C# Programmer with very little java experience so I tested the RegEx in the "Regular Expression Workbench" - a great C# Program for testing RegEx. There my RegEx works fine.

https://code.msdn.microsoft.com/RegexWorkbench

RegExWB:

.*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*

Matching:
FREQ=WEEKLY;WKST=MO;BYDAY=22TU,-2WE,+223FR
  1 => 22
  1 => -2
  1 => +223
  2 => TU
  2 => WE
  2 => FR


You may also use this approach to increase readability and up to certain point independence from the implementation using a more common regexp subset

final Pattern re1 = Pattern.compile(".*;BYDAY=(.*)");
final Pattern re2 = Pattern.compile("(?:([+-]?[0-9]*)([A-Z]{2}),?)");

final Matcher matcher1 = re1.matcher(rrule);
if ( matcher1.matches() ) {
    final String group1 = matcher1.group(1);
    Matcher matcher2 = re2.matcher(group1);
    while(matcher2.find()) {
        System.out.println("group: " + matcher2.group(1) + " " +
                    matcher2.group(2));
    }
}


Your regex works the same in Java as it does in C#; it's just that in Java you can only access the final capture for each group. In fact, .NET is one of only two regex flavors I know of that let you retrieve intermediate captures (Perl 6 being the other).

This is probably the simplest way to do what you want in Java:

String s= "FREQ=WEEKLY;WKST=MO;BYDAY=22TU,-2WE,+223FR";
Pattern p = Pattern.compile("(?:;BYDAY=|,)([+-]?[0-9]+)([A-Z]{2})");
Matcher m = p.matcher(s);
while (m.find())
{
  System.out.printf("Interval: %5s, Day of Week: %s%n",
                    m.group(1), m.group(2));
}

Here's the equivalent C# code, in case you're interested:

string s = "FREQ=WEEKLY;WKST=MO;BYDAY=22TU,-2WE,+223FR";
Regex r = new Regex(@"(?:;BYDAY=|,)([+-]?[0-9]+)([A-Z]{2})");
foreach (Match m in r.Matches(s))
{
  Console.WriteLine("Interval: {0,5}, Day of Week: {1}",
                    m.Groups[1], m.Groups[2]);
}


I'm a bit rusty, but I'll propose to "caveats". First of all, regexp(s) come in various dialects. There is a fantastic O'Reilly book about this, but there is a chance that your C# utility applies slightly different rules.

As an example, I used a similar (but different tool) and discovered that it did parse things differenty...

First of all it rejected your regexp (maybe a typo?) the initial "*" does not make sense, unless you put a dot (.) in front of it. Like this:

.*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*

Now it was accepted, but it "matched" only the 2/WE part, and "skipped" the 2/TU pair.

(I suggest you read about greedy and non-greedy matching to understand this a bit better.

Therefore I updated your pattern as follows:

.*;BYDAY=(?:([+-]?[0-9]*)([A-Z]{2}),?),(?:([+-]?[0-9]*)([A-Z]{2}),?)*.*

And now it works and correctly captures 2,TU,2 and WE.

Maybe this helps?

0

精彩评论

暂无评论...
验证码 换一张
取 消