A sever I can't influence sends very broken XML.
Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is â 98 86
(9 bytes) in a file that's declared as utf-8 with no DTD.
I couldn't configure W3C tidy in a way that doesn't garble this irreversibly. I only found how to make lxml skip it silently. SAX uses Expat, which cannot recover after encounter开发者_如何学Going this. I'd like to avoid BeautifulSoup for speed reasons.
What else is there?
BeautifulSoup
is your best bet in this case. I suggest profiling before ruling out BeautifulSoup
altogether.
Maybe something like:
import htmlentitydefs as ents
from lxml import etree # or maybe 'html' , if the input is still more broken
def repl_ent(m):
return ents.entitydefs[m.group()[1:-1]]
goodxml = re.sub( '&\w+;', repl_ent, badxml )
etree.fromstring( goodxml )
精彩评论