开发者

Parsing ODF in Python with lxml

开发者 https://www.devze.com 2023-04-05 11:15 出处:网络
I\'m trying to parse the content.xml inside a ODF-file. I\'ve read the file into a string and i\'ve got a tree object with lxml.etree: tree = etree.XML(string)

I'm trying to parse the content.xml inside a ODF-file. I've read the file into a string and i've got a tree object with lxml.etree:

tree = etree.XML(string)

But now I need to find every subelement that is tex开发者_StackOverflow中文版t:a OR text:h. I've been told in previous question that I could use XPath. I've tried but got stuck every single time. Can't even find one of those elements.

If i try:

elem = tree.xpath('//text:p')
I just get a
XPathEvalError: Undefined namespace prefix

So how do I get a list with BOTH of thoose subelements in the right order so i can iterate over them?


That's because text is a namespace abbreviation, defined in the ODF schema. Try

tree.xpath('//text:a | //text:h',
           namespaces={'text': 'urn:oasis:names:tc:opendocument:xmlns:text:1.0'})

| is the set union operator. See also LXML docs.

0

精彩评论

暂无评论...
验证码 换一张
取 消