I am attempting to manipulate a DOM tree using lxml
's etree
module. One task I haven't figured out yet is how to test whether a particular node is still part of a parsed tree. Since the behavior of etree
is mostly undefined if you remove nodes during _ElementTree.iter()
, I do the manipulation in two phases.
First, I iterate through the parsed tree and mark some nodes for removal and certain other nodes for further processing by placing them in respective lists. The second phase consists of iterating through the list of nodes to remove and removing them from the tree. At this point, I have a list of nodes to process further and a tree that has been pruned substantially since it was first parsed.
What I lack is a way to test if a particular node in my nodes-to-process list still lives in the parse tree. If it isn't part of the tree, that means it's a descendant of one of the nodes I removed earlier and I want to discard it. The problem is that there isn't an obvious way to do this test cheaply. Even after a node has been removed from the _ElementTree
calling getroottree()
on that node returns the original tree.
I could call iterancestors()
on each node-to-process and check for the root element I would expect for an in-tree node, but this is O(n) and won't scale well开发者_开发技巧 for deep DOM trees.
Does anyone know of a constant time operation, given an Element
and an _ElementTree
, to test whether the former is part of the latter?
I realize that traversing a node's parent chain upward might be the only way to do this test, and any faster way would require some bookkeeping to be implemented by the library.
Step 0: parse the xml into a tree.
Step 1: iterate over the tree, deleting nodes that need to be deleted.
Step 2: iterate over the remaining nodes, processing those that need it.
If you own step 0, you can use iterparse() with end events to save building a large tree only to remove many nodes later, and making step 1 much simpler:
for event, elem in etree.iterparse(input_xml):
if elem needs deleting:
elem.clear() # remove text, tail, attributes, and descendant elements
delete_todo.append(elem)
精彩评论