I have the following text:
abcabcabcabc<2007-01-12><name1><2007-01-12>abcabcabcabc<name2><2007-01-11>abcabcabcabc<name3><2007-02-12>abcabcabcabc<name4>abcabcabcabc<2007-03-12><name5><date>abcabcabcabc<name6>
I need to use regular expressions in order to clean the above text:
The basic extraction rule is:
<2007-01-12>abcabcabcabc<name2>
I have no problem extracting this pattern. My issue is that within th text I have malformed sequences: If the text doesn't start with a date, and end with a name my extraction fails. For example, the text above may have several mal formed sequences, such as:
abcabcabcabc<2007-01-12><name1>
Should be:
<2007-01-12>abcabcabcabc<name1>
Is it possible to have a regular expression that would clean the above, pri开发者_如何学Pythonor to extracting my consistent pattern. In short, i need to find all mal formed patterns, and then take the date tag and put it in front of it, as provided in the example above.
Thanks.
Do you need something like this perhaps?
public class Extract {
public static void main(String[] args) {
String text =
"abcabcabcabc<2007-01-12><name1>" +
"<2007-01-12>abcabcabcxxx<name2>" +
"<2007-01-11>abcabcabcyyy<name3>" +
"<2007-02-12>abcabcabczzz<name4>" +
"abcabcabc123<2007-03-12><name5>" +
"<date>abcabcabc456<name6>";
System.out.println(
text.replaceAll(
"(text)<(text)>(text)<(text)>"
.replace("text", "[^<]*"),
"$1$3 - $2 - $4\n"
)
);
}
}
This prints:
abcabcabcabc - 2007-01-12 - name1
abcabcabcxxx - 2007-01-12 - name2
abcabcabcyyy - 2007-01-11 - name3
abcabcabczzz - 2007-02-12 - name4
abcabcabc123 - 2007-03-12 - name5
abcabcabc456 - date - name6
Essentially, there are 3 parts:
- The naked text is captured by
\1
and\3
-- one of these should be an empty string - The date is
\2
- The name is
\4
You can of course use a Matcher
and extract individual group
too.
References
- regular-expressions.info/Grouping
精彩评论