开发者

Need fresh eyes for Java regular expression, which is too greedy

开发者 https://www.devze.com 2023-03-21 19:25 出处:网络
I have a string of the form开发者_JAVA技巧: canonical_class_name[key1=\"value1\",key2=\"value2\",key3=\"value3\",...]

I have a string of the form开发者_JAVA技巧:

canonical_class_name[key1="value1",key2="value2",key3="value3",...] 

The purpose is to capture the canonical_class_name in a group and then alternating key=value groups. Currently it does not match a test string (in the following program, testString).

There must be at least one key/value pair, but there may be many such pairs.

Question: Currently the regex grabs the canonical class name, and the first key correctly but then it gobbles up everything until the last double quote, how do I make it grab the key value pairs lazy?

Here is the regular expression which the following program puts together:

(\S+)\[\s*(\S+)\s*=\s*"(.*)"\s*(?:\s*,\s*(\S+)\s*=\s*"(.*)"\s*)*\]

Depending on your preference you may find the programs version easier to read.

If my program is passed the String:

org.myobject[key1=\"value1\", key2=\"value2\", key3=\"value3\"]

...these are the groups I get:

Group1 contains: org.myobject<br/>
Group2 contains: key1<br/>
Group3 contains: value1", key2="value2", key3="value3<br/>

One more note, using String.split() I can simplify the expression, but I'm using this as a learning experience to better my regex understanding, so I don't want to use such a short cut.

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class BasicORMParser {
     String regex =
            "canonicalName\\[ map (?: , map )*\\]"
            .replace("canonicalName", "(\\S+)")
            .replace("map", "key = \"value\"")
            .replace("key", "(\\S+)")
            .replace("value", "(.*)")
            .replace(" ", "\\s*"); 

    List<String> getGroups(String ormString){
        List<String> values = new ArrayList();
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(ormString);
        if (matcher.matches() == false){
            String msg = String.format("String failed regex validiation. Required: %s , found: %s", regex, ormString);
            throw new RuntimeException(msg);
        }
        if(matcher.groupCount() < 2){
            String msg = String.format("Did not find Class and at least one key value.");
            throw new RuntimeException(msg);
        }
        for(int i = 1; i < matcher.groupCount(); i++){
            values.add(matcher.group(i));
        }
        return values;
    }
}


You practically answered the question yourself: make them lazy. That is, use lazy (a.k.a. non-greedy or reluctant) quantifiers. Just change each (\S+) to (\S+?), and each (.*) to (.*?). But if it were me, I'd change those subexpressions so they can never match too much, regardless of greediness. For example, you could use ([^\s\[]+) for the class name, ([^\s=]+) for the key, and "([^"]*)" for the value.

I don't think that's going to solve your real problem, though. Once you've got it so it correctly matches all the key/value pairs, you'll find that it only captures the first pair (groups #2 and #3) and the last pair (groups #4 and #5). That's because, each time (?:\s*,\s*(\S+)\s*=\s*"(.*)"\s*)* gets repeated, those two groups get their contents overwritten, and whatever they captured on the previous iteration is lost. There's no getting around it, this is at least a two-step operation. For example, you could match all of the key/value pairs as a block, then break out the individual pairs.

One more thing. This line:

if(matcher.groupCount() < 2){

...probably isn't doing what you think it does. groupCount() is a static property of the Pattern object; it tells how many capturing groups there are in the regex. Whether the match succeeds or fails, groupCount() will always return the same value--in this case, five. If the match succeeds, some of the capture groups may be null (indicating that they didn't participate in the match), but there will always be five of them.


EDIT: I suspect this is what you were trying for initially:

Pattern p = Pattern.compile(
    "(?:([^\\s\\[]+)\\[|\\G)([^\\s=]+)=\"([^\"]*)\"[,\\s]*");

String s = "org.myobject[key1=\"value1\", key2=\"value2\", key3=\"value3\"]";
Matcher m = p.matcher(s);
while (m.find())
{
  if (m.group(1) != null)
  {
    System.out.printf("class : %s%n", m.group(1));
  }
  System.out.printf("key : %s, value : %s%n", m.group(2), m.group(3));
}

output:

class : org.myobject
key : key1, value : value1
key : key2, value : value2
key : key3, value : value3

The key to understanding the regex is this part: (?:([^\s\[]+)\[|\G). On the first pass it matches the class name and the opening square bracket. After that, \G takes over, anchoring the next match to the position where the previous match ended.


For non-greedy matching, append a ? after the pattern. e.g., .*? matches the fewest number of characters possible.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号