开发者

Problem capturing data inside of a capture that is optional

开发者 https://www.devze.com 2023-03-31 21:23 出处:网络
It\'s best to start with an example and what I\'ve gotten so far. Sample Data: FOO foo@acme.com 5545 <Data><Name>tester</Name><Foo>bar</Foo></Data>

It's best to start with an example and what I've gotten so far.

Sample Data:

FOO foo@acme.com 5545
<Data><Name>tester</Name><Foo>bar</Foo></Data>

Current regex:

/FOO\s(.{1,20}@[^\s]+)\s.{0,20}\s{1,2}(<Data>.{0,100}<Name>(.{0,20})<\/开发者_如何学PythonName>.{0,100}<\/Data>)?/m

Matches from regex:

  1. foo@acme.com
  2. testerbar
  3. tester

I've wrapped the <Data> section in parenthesis followed-by a ? because the entire data section may or may not exist. However, the <Name> section is also optional, it may or may not exist. So I tried putting parenthesis around <Name> with a question mark as well but then I don't get the matches:

/FOO\s(.{1,20}@[^\s]+)\s.{0,20}\s{1,2}(<Data>.{0,100}(<Name>(.{0,20})<\/Name>)?.{0,100}<\/Data>)?/m

I've posted my regex and sample data on a regex site to make it easier to test/validate what I'm trying to do: http://www.rubular.com/r/ZhQzlNp1vv

In the <Data> section there is <Name> and even <Foo>. The point is, there may be many different elements in <Data> and I only care about extracting data from some of them. I need to use regex for my particular situation so please don't suggest using some XML parsing library (thanks!).

Thanks in advance.


/FOO\s(\S+@\S+).*?\n(?:.{0,100}(.{0,20})</Name>.{0,100}</Data>)?/m

http://www.rubular.com/r/IhisH7HYJR


To capture an optional group, use a non-capturing group to indicate the optionality inside a capturing group.

i.e.

((?:content)?)

The outer parentheses form the capturing group - if the optional group doesn't match you get an empty string. The (?:...) is the non-capturing group, which allows you to group the content (so it can all be made optional) without capturing it.

Update:
Whenever you have a complex regex, use free-spacing comment mode (flag=x) to make it readable (and thus far easier to figure out what's going on), like this:

FOO\s(.{1,20}@[^\s]+)\s.{0,20}\s{1,2}

((?:<Data>
    # upto 200 chars, excluding captured tags or end tag (repeated below)
    (?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}

    # Capture 3:
    ((?:<Name>.{0,20}<\/Name>)?)

    (?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}

    # Capture 4:
    ((?:<Foo>.{0,20}<\/Foo>)?)

    (?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}

    # Capture 5:
    ((?:<Bob>.{0,20}<\/Bob>)?)

    (?:(?!<Name>|<Foo>|<Bob>|<\/Data>).){0,200}
<\/Data>)?)

Which at rubular results in:

1. foo@acme.com
2. <Data><Name>tester</Name><Foo>bar</Foo></Data>
3. <Name>tester</Name>
4. <Foo>bar</Foo>
5. 

Annoyingly rubular doesn't seem to provide a multi-line editor when x is turned on, which sucks, and it also doesn't support standard comment syntax, so I had to change those #... to (?#...) which is less readable. Oh well.

If you need the values without the tags, you'll need a separate expression to strip those.
( Or, y'know, use a tool actually designed for the job. ;) )

0

精彩评论

暂无评论...
验证码 换一张
取 消