开发者

Multiple tag names in lxml's iterparse?

开发者 https://www.devze.com 2023-01-12 13:00 出处:网络
Is there a way to get multiple tag names from lxml\'s lxml.etree.iterparse? I have a file-like object with an expensive read operation and many tags, so getting all tags or doing two开发者_StackOverfl

Is there a way to get multiple tag names from lxml's lxml.etree.iterparse? I have a file-like object with an expensive read operation and many tags, so getting all tags or doing two开发者_StackOverflow中文版 passes is suboptimal.

Edit: It would be something like Beautiful Soup's find(['tag-1', 'tag-2]), except as an argument to iterparse. Imagine parsing an HTML page for both <td> and <div> tags.


I know I'm late for the game, but maybe someone else needs help with the same issue. This code will generate events for both Tag1 and Tag2 tags:

etree.iterparse(io.BytesIO(xml), events=('end',), tag=('Tag1', 'Tag2'))


I'm not 100% sure what you mean here by "getting all tags", but perhaps this is what you're looking for:

for event, elem in iterparse(file_like_object):
    if elem.tag == 'td' or elem.tag == 'div':
        # reached the end of an interesting tag
        print 'found:', elem.tag
        # possibly quit early to prevent further parsing
        if exit_condition: break

iterparse generates events on the fly during parsing, so you're only reading as much data as is required. However, there's no way you can skip reading elements during parsing, as you wouldn't know how far to skip. In the above, we just ignore tags that we're not interested in.

As you may already know: don't use xml parsers for html. Edit - It turns out that lxml supports html parsing, but you should check the docs to see to what extent.

0

精彩评论

暂无评论...
验证码 换一张
取 消