开发者

capture text, including tags from string, and then reorder tags with text

开发者 https://www.devze.com 2023-01-03 16:37 出处:网络
I have the following text: abcabcabcabc<2007-01-12><name1><2007-01-12>abcabcabcabc<name2><2007-01-11>abcabcabcabc<name3><2007-02-12>abcabcabcabc<name4>

I have the following text:

abcabcabcabc<2007-01-12><name1><2007-01-12>abcabcabcabc<name2><2007-01-11>abcabcabcabc<name3><2007-02-12>abcabcabcabc<name4>abcabcabcabc<2007-03-12><name5><date>abcabcabcabc<name6>

I need to use regular expressions in order to clean the above text:

The basic extraction rule is:

<2007-01-12>abcabcabcabc<name2>

I have no problem extracting this pattern. My issue is that within th text I have malformed sequences: If the text doesn't start with a date, and end with a name my extraction fails. For example, the text above may have several mal formed sequences, such as:

abcabcabcabc<2007-01-12><name1>

Should be:

<2007-01-12>abcabcabcabc<name1>

Is it possible to have a regular expression that would clean the above, pri开发者_如何学Pythonor to extracting my consistent pattern. In short, i need to find all mal formed patterns, and then take the date tag and put it in front of it, as provided in the example above.

Thanks.


Do you need something like this perhaps?

public class Extract {
    public static void main(String[] args) {
        String text =
            "abcabcabcabc<2007-01-12><name1>" +
            "<2007-01-12>abcabcabcxxx<name2>" +
            "<2007-01-11>abcabcabcyyy<name3>" +
            "<2007-02-12>abcabcabczzz<name4>" +
            "abcabcabc123<2007-03-12><name5>" +
            "<date>abcabcabc456<name6>";
        System.out.println(
            text.replaceAll(
                "(text)<(text)>(text)<(text)>"
                    .replace("text", "[^<]*"),
                "$1$3 - $2 - $4\n"
            )
        );
    }
}

This prints:

abcabcabcabc - 2007-01-12 - name1
abcabcabcxxx - 2007-01-12 - name2
abcabcabcyyy - 2007-01-11 - name3
abcabcabczzz - 2007-02-12 - name4
abcabcabc123 - 2007-03-12 - name5
abcabcabc456 - date - name6

Essentially, there are 3 parts:

  • The naked text is captured by \1 and \3 -- one of these should be an empty string
  • The date is \2
  • The name is \4

You can of course use a Matcher and extract individual group too.

References

  • regular-expressions.info/Grouping
0

精彩评论

暂无评论...
验证码 换一张
取 消