开发者

Fixing tostring() in Python's lxml

开发者 https://www.devze.com 2023-02-02 11:27 出处:网络
lxml\'s tostring() function seems quite broken when printing only parts of documents. Witness: from lxml.html import fragment_fromstring, tostring

lxml's tostring() function seems quite broken when printing only parts of documents. Witness:

from lxml.html import fragment_fromstring, tostring
frag = fragment_fromstring('<p>This stuff is <em>really</em> great!')
em = frag.cssselect('em').pop(0)
print tostring(em)

I expect <em>really</em> but instead it prints <em>really</em> great! which is wrong. The ' great !' is not part of the selected em. It's not only wrong, it's a pill, at least for processing document-structured X开发者_JS百科ML, where such trailing text will be common.

As I understand it, lxml stores any free text that comes after the current element in the element's .tail attribute. A scan of the code for tostring() brings me to ElementTree.py's _write() function, which clearly always prints the tail. That's correct behavior for whole trees, but not on the last element when rendering a subtree, yet it makes no distinction.

To get a proper tail-free rendering of the selected XML, I tried writing a toxml() function from scratch to use in its place. It basically worked, but there are many special cases in handling comments, processing instructions, namespaces, encodings, yadda yadda. So I changed gears and now just piggyback tostring(), post-processing its output to remove the offending .tail text:

def toxml(e):
    """ Replacement for lxml's tostring() method that doesn't add spurious
    tail text. """

    from lxml.etree import tostring
    xml = tostring(e)
    if e.tail:
        xml = xml[:-len(e.tail)]
    return xml

A basic series of tests shows this works nicely.

Critiques and/or suggestions?


How about xml = lxml.etree.tostring(e, with_tail=False)?

from lxml.html import fragment_fromstring
from lxml.etree import tostring
frag = fragment_fromstring('<p>This stuff is <em>really</em> great!')
em = frag.cssselect('em').pop(0)
print tostring(em, with_tail=False)

Looks like with_tail was added in v2.0; do you have an older version?

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号