开发者

How to parse broken XML in Python?

开发者 https://www.devze.com 2023-01-13 12:16 出处：网络

A sever I can\'t influence sends very broken XML. Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is &

相关专题：python xml

A sever I can't influence sends very broken XML.

Specifically, a Unicode WHITE STAR would get encoded as UTF-8 (E2 98 86) and then translated using a Latin-1 to HTML entity table. What I get is â 98 86 (9 bytes) in a file that's declared as utf-8 with no DTD.

I couldn't configure W3C tidy in a way that doesn't garble this irreversibly. I only found how to make lxml skip it silently. SAX uses Expat, which cannot recover after encounter开发者_如何学Going this. I'd like to avoid BeautifulSoup for speed reasons.

What else is there?

BeautifulSoup is your best bet in this case. I suggest profiling before ruling out BeautifulSoup altogether.

Maybe something like:

import htmlentitydefs as ents
from lxml import etree  # or maybe 'html' , if the input is still more broken
def repl_ent(m): 
     return ents.entitydefs[m.group()[1:-1]]
goodxml = re.sub( '&\w+;', repl_ent, badxml )
etree.fromstring( goodxml )