I am using python ElementTree to read a开发者_Python百科nd modify some content of my html files. When I am done with changes and use ElementTree.write function,
1) it adds extra html: infront of all the tags. How should I avoid that?
2) It also adds & where I have special characters. How should i avoid that?
Thank you, Divya.
You can't. ElementTree works by loading the XML, parsing it, and only storing an abstract representation. It writes that out to a string by walking the abstract representation, but it doesn't remember things like which characters were escaped as entities, or whether an element was stored as <foo/>
or <foo></foo>
(HTML: <foo>
or <foo></foo>
)
Now, since ElementTree only works with XML (not HTML), I'm guessing you're working with lxml.html -- in this case, it in fact automatically corrects certain forms of erroneous HTML, because otherwise it wouldn't be able to store it correctly.
The right way to handle HTML whose data you want to be completely preserved except how you alter it, is to grab it in tokens that remember their original representation. I've done this using sgmllib, but this is imperfect -- e.g. there's a get_starttag_text
method for getting the exact content of a start tag, but no corresponding method for end tags. It might be good enough anyway.
For example, to write out HTML where all the paragraphs are removed, one might write the function like this:
from cStringIO import StringIO
class SGMLModifier(sgmllib.SGMLParser):
def __init__(self, *args, **kwargs):
sgmllib.SGMLParser.__init__(self, *args, **kwargs)
self._file = StringIO()
def getvalue(self):
return self._file.getvalue()
def start_b(self, attributes):
# skip it
pass
def end_b(self):
# skip it
pass
def unknown_starttag(self, tag, attributes):
self._file.write(self.get_starttag_text())
def unknown_endtag(self, tag):
# we can't get this verbatim.
self._file.write('</%s>' % tag)
def handle_comment(self, comment):
# no verbatim here either.
self._file.write('<!-- %s -->' % comment)
def handle_data(self, data):
self._file.write(data)
def convert_entityref(self, ref):
return '&' + ref + ';'
def remove_bold(html):
parser = SGMLModifier()
parser.feed(html)
return parser.getvalue()
This might need a bit more work to not mangle the input. Check the documentation for details on everything.
精彩评论