开发者

Python findall and regular expressions

开发者 https://www.devze.com 2023-02-17 23:40 出处:网络
I am parsing an xml file (called xml below) that has lines of two varying types: 1. <line a=\"a1\" b=\"b1\" c=\"c1\">

I am parsing an xml file (called xml below) that has lines of two varying types:

1. <line a="a1" b="b1" c="c1">
2. <line a="a2" c="c2">

I am trying to pull a2 and c2 onl开发者_StackOverflowy from the second type, however this regular expression also captures the first type:

>>> list = re.findall('<line a="(.*)" c="(.*)">', xml)
>>> print(list)
[('a1" b="b1', 'c1'), ('a2', 'c2')]

How would I capture just the second type?


This makes much more sense with a proper XML parsing library like ElementTree, instead of resorting to regex. For instance:

>>> xmlstr = """\
... <root>
...   <line a="a1" b="b1" c="c1"></line>
...   <line a="a2" c="c2"></line>
... </root>
... """
>>> import xml.etree.ElementTree as ET
>>> root = ET.XML(xmlstr)
>>> root.findall('./line')
[<Element 'line' at 0x226db70>, <Element 'line' at 0x226de48>]
>>> filtered = [line for line in root.findall('./line') if line.get('b') is None]
>>> for line in filtered:
...     print ET.tostring(line)
...
<line a="a2" c="c2" />

>>>


The * operator is greedy by default. Try ([^"]*) instead of (.*)

0

精彩评论

暂无评论...
验证码 换一张
取 消