开发者

getting expat to use .dtd for entity replacement in python

开发者 https://www.devze.com 2023-01-01 02:17 出处:网络
I\'m trying to read in an xml file which looks like this <?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>

I'm trying to read in an xml file which looks like this

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>
<incollection>
<author>Jos&eacute; A. Blakeley</author>
</incollection>
</dblp>

The point that creates the problem looks is the

Jos&eacute; A. Blakeley

part: The parser calls its character handler twice, once with "Jos", once with " A. Blakeley". Now I understand this may be the correct behaviour if it doesn't know the eacute entity. However, this is defined in the dblp.dtd, which I have. I don't seem to be able to convince expat to use this file, though. All I can say is

p = xml.parsers.expat.ParserCreate()
开发者_开发技巧# tried with and without following line
p.SetParamEntityParsing(xml.parsers.expat.XML_PARAM_ENTITY_PARSING_ALWAYS) 
p.UseForeignDTD(True)
f = open(dblp_file, "r")
p.ParseFile(f)

but expat still doesn't recognize my entity. Why is there no way to tell expat which DTD to use? I've tried

  • putting the file into the same directory as the XML
  • putting the file into the program's working directory
  • replacing the reference in the xml file by an absolute path

What am I missing? Thx.


As I understand it, if you're using pyexpat directly, then you have to provide your own ExternalEntityRefHandler to fetch the external DTD and feed it to expat.

See eg. xml.sax.expatreader for example code (method external_entity_ref, line 374 in Python 2.6).

It would probably be better to use a higher-level interface such as SAX (via expatreader) if you can.


btw I can temporarily help myself by copying the relevant parts of the .dtd into the XML file itself, as in

<!DOCTYPE dblp [
    <!ENTITY Agrave  "&#192;" >
]>

but that doesn't really solve the problem in a general way.

0

精彩评论

暂无评论...
验证码 换一张
取 消