开发者

Editing tree in place while iterating in lxml

开发者 https://www.devze.com 2023-03-09 05:14 出处:网络
I am using lxml to parse html and edit it to produce a new document. Essentially, I\'m trying to use it somewhat like the javascript DOM - I know this is not really the intended use, but much of it wo

I am using lxml to parse html and edit it to produce a new document. Essentially, I'm trying to use it somewhat like the javascript DOM - I know this is not really the intended use, but much of it works well so far.

Currently, I use iterdescendants() to get a iterable list of elements and then deal with each in turn.

However, if an element is dropped during the iteration, its children are still considered, since the dropping does not affect the iteration, as you would expect. In order to get the results I want, this hack works:

from lxml.html import fromstring, tostring
import urllib2
import re

html = '''
<html>
<head>
</head>

<body>
    <div>
        <p class="unwanted">This conten开发者_如何转开发t should go</p>
        <p class="fine">This content should stay</p>
    </div>

    <div id = "second" class="unwanted">
        <p class = "alreadydead">This content should not be looked at</p>
        <p class = "alreadydead">Nor should this</>
        <div class="alreadydead">
            <p class="alreadydead">Still dead</p>
        </div>
    </div>

    <div>
        <p class="yeswanted">This content should also stay</p>
    </div>
</body>

for element in allElements:
   s = "%s%s" % (element.get('class', ''), element.get('id', ''))        
   if re.compile('unwanted').search(s):
       for i in range(len(element.findall('.//*'))):
           allElements.next()
       element.drop_tree()

print tostring(page.body)

This outputs:

<body>
    <div>

        <p class="yeswanted">This content should stay</p>
    </div>



    <div>
        <p class="yeswanted">This content should also stay</p>
    </div>
</body>

This feels like a nasty hack - is there a more sensible way to achieve this using the library?


To simplify things you can use lxml's support for regular expressions within an XPath to find and kill the unwanted nodes without needing to iterate over all descendants.

This produces the same result as your script:

EXSLT_NS = 'http://exslt.org/regular-expressions'
XPATH = r"//*[re:test(@class, '\bunwanted\b') or re:test(@id, '\bunwanted\b')]"

tree = lxml.html.fromstring(html)
for node in tree.xpath(XPATH, namespaces={'re': EXSLT_NS}):
    node.drop_tree()
print lxml.html.tostring(tree.body)
0

精彩评论

暂无评论...
验证码 换一张
取 消