It's best to start with an example and what I've gotten so far.
Sample Data:
FOO foo@acme.com 5545
<Data><Name>tester</Name><Foo>bar</Foo></Data>
Current regex:
/FOO\s(.{1,20}@[^\s]+)\s.{0,20}\s{1,2}(<Data>.{0,100}<Name>(.{0,20})<\/开发者_如何学PythonName>.{0,100}<\/Data>)?/m
Matches from regex:
- foo@acme.com
- testerbar
- tester
I've wrapped the <Data>
section in parenthesis followed-by a ?
because the entire data section may or may not exist. However, the <Name>
section is also optional, it may or may not exist. So I tried putting parenthesis around <Name>
with a question mark as well but then I don't get the matches:
/FOO\s(.{1,20}@[^\s]+)\s.{0,20}\s{1,2}(<Data>.{0,100}(<Name>(.{0,20})<\/Name>)?.{0,100}<\/Data>)?/m
I've posted my regex and sample data on a regex site to make it easier to test/validate what I'm trying to do: http://www.rubular.com/r/ZhQzlNp1vv
In the <Data>
section there is <Name>
and even <Foo>
. The point is, there may be many different elements in <Data>
and I only care about extracting data from some of them. I need to use regex for my particular situation so please don't suggest using some XML parsing library (thanks!).
Thanks in advance.
/FOO\s(\S+@\S+).*?\n(?:.{0,100}(.{0,20})</Name>.{0,100}</Data>)?/m
http://www.rubular.com/r/IhisH7HYJR
To capture an optional group, use a non-capturing group to indicate the optionality inside a capturing group.
i.e.
((?:content)?)
The outer parentheses form the capturing group - if the optional group doesn't match you get an empty string. The (?:
...)
is the non-capturing group, which allows you to group the content (so it can all be made optional) without capturing it.
Update:
Whenever you have a complex regex, use free-spacing comment mode (flag=x) to make it readable (and thus far easier to figure out what's going on), like this:
FOO\s(.{1,20}@[^\s]+)\s.{0,20}\s{1,2}
((?:<Data>
# upto 200 chars, excluding captured tags or end tag (repeated below)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 3:
((?:<Name>.{0,20}<\/Name>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 4:
((?:<Foo>.{0,20}<\/Foo>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
# Capture 5:
((?:<Bob>.{0,20}<\/Bob>)?)
(?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
<\/Data>)?)
Which at rubular results in:
1. foo@acme.com
2. <Data><Name>tester</Name><Foo>bar</Foo></Data>
3. <Name>tester</Name>
4. <Foo>bar</Foo>
5.
Annoyingly rubular doesn't seem to provide a multi-line editor when x is turned on, which sucks, and it also doesn't support standard comment syntax, so I had to change those #...
to (?#...)
which is less readable. Oh well.
If you need the values without the tags, you'll need a separate expression to strip those.
( Or, y'know, use a tool actually designed for the job. ;) )
精彩评论