开发者

Regex to split nested coordinate strings

开发者 https://www.devze.com 2022-12-19 07:57 出处:网络
I have a String of the format \"[(1, 2), (2, 3), (3, 4)]\", with an arbitrary number of elements. I\'m t开发者_如何学编程rying to split it on the commas separating the coordinates, that is, to retriev

I have a String of the format "[(1, 2), (2, 3), (3, 4)]", with an arbitrary number of elements. I'm t开发者_如何学编程rying to split it on the commas separating the coordinates, that is, to retrieve (1, 2), (2, 3), and (3, 4).

Can I do it in Java regex? I'm a complete noob but hoping Java regex is powerful enough for it. If it isn't, could you suggest an alternative?


From Java 5

Scanner sc = new Scanner();
sc.useDelimiter("\\D+"); // skip everything that is not a digit
List<Coord> result = new ArrayList<Coord>();
while (sc.hasNextInt()) {
    result.add(new Coord(sc.nextInt(), sc.nextInt()));
}
return result;

EDIT: We don't know how much coordinates are passed in the string coords.


You can use String#split() for this.

String string = "[(1, 2), (2, 3), (3, 4)]";
string = string.substring(1, string.length() - 1); // Get rid of braces.
String[] parts = string.split("(?<=\\))(,\\s*)(?=\\()");
for (String part : parts) {
    part = part.substring(1, part.length() - 1); // Get rid of parentheses.
    String[] coords = part.split(",\\s*");
    int x = Integer.parseInt(coords[0]);
    int y = Integer.parseInt(coords[1]);
    System.out.printf("x=%d, y=%d\n", x, y);
}

The (?<=\\)) positive lookbehind means that it must be preceded by ). The (?=\\() positive lookahead means that it must be suceeded by (. The (,\\s*) means that it must be splitted on the , and any space after that. The \\ are here just to escape regex-specific chars.

That said, the particular String is recognizeable as outcome of List#toString(). Are you sure you're doing things the right way? ;)

Update as per the comments, you can indeed also do the other way round and get rid of non-digits:

String string = "[(1, 2), (2, 3), (3, 4)]";
String[] parts = string.split("\\D.");
for (int i = 1; i < parts.length; i += 3) {
    int x = Integer.parseInt(parts[i]);
    int y = Integer.parseInt(parts[i + 1]);
    System.out.printf("x=%d, y=%d\n", x, y);
}

Here the \\D means that it must be splitted on any non-digit (the \\d stands for digit). The . after means that it should eliminate any blank matches after the digits. I must however admit that I'm not sure how to eliminate blank matches before the digits. I'm not a trained regex guru yet. Hey, Bart K, can you do it better?

After all, it's ultimately better to use a parser for this. See Huberts answer on this topic.


If you do not require the expression to validate the syntax around the coordinates, this should do:

\(\d+,\s\d+\)

This expression will return several matches (three with the input from your example).

In your question, you state that you want to "retreive (1, 2), (2, 3), and (3, 4). In the case that you actually need the pair of values associated with each coordinate, you can drop the parentheses and modify the regex to do some captures:

(\d+),\s(\d+)

The Java code will look something like this:

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile("(\\d+),\\s(\\d+)");
        Matcher matcher = pattern.matcher("[(1, 2), (2, 3), (3, 4)]");

        while (matcher.find()) {
            int x = Integer.parseInt(matcher.group(1));
            int y = Integer.parseInt(matcher.group(2));
            System.out.printf("x=%d, y=%d\n", x, y);
        }
    }
}


If you use regex, you are going to get lousy error reporting and things will get exponentially more complicated if your requirements change (For instance, if you have to parse the sets in different square brackets into different groups).

I recommend you just write the parser by hand, it's like 10 lines of code and shouldn't be very brittle. Track everything you are doing, open parens, close parens, open braces & close braces. It's like a switch statement with 5 options (and a default), really not that bad.

For a minimal approach, open parens and open braces can be ignored, so there are really only 3 cases.


This would be the bear minimum.

// Java-like psuedocode
int valuea;
String lastValue;
tokens=new StringTokenizer(String, "[](),", true);

for(String token : tokens) {  

    // The token Before the ) is the second int of the pair, and the first should
    // already be stored
    if(token.equals(")"))
        output.addResult(valuea, lastValue.toInt());

    // The token before the comma is the first int of the pair
    else if(token.equals(",")) 
        valuea=lastValue.toInt();

    // Just store off this token and deal with it when we hit the proper delim
    else
        lastValue=token;
}

This is no better than a minimal regex based solution EXCEPT that it will be MUCH easier to maintain and enhance. (add error checking, add a stack for paren & square brace matching and checking for misplaced commas and other invalid syntax)

As an example of expandability, if you were to have to place different sets of square-bracket delimited groups into different output sets, then the addition is something as simple as:

    // When we close the square bracket, start a new output group.
    else if(token.equals("]"))
        output.startNewGroup();

And checking for parens is as easy as creating a stack of chars and pushing each [ or ( onto the stack, then when you get a ] or ), pop the stack and assert that it matches. Also, when you are done, make sure your stack.size() == 0.


Will there always be 3 groups of coordinates that need to be analyzed?

You could try:

\[(\(\d,\d\)), (\(\d,\d\)), (\(\d,\d\))\]


In regexes, you can split on (?<=\)), which use Positive Lookbehind:

string[] subs = str.replaceAll("\[","").replaceAll("\]","").split("(?<=\)),");

In simpe string functions, you can drop the [ and ] and use string.split("),"), and return the ) after it.

0

精彩评论

暂无评论...
验证码 换一张
取 消