开发者

Parsing xml containing character reference

开发者 https://www.devze.com 2022-12-27 08:12 出处:网络
The XML im trying to parse contains a control character 0x2 inside CDATA. I tried to replace it with character reference which led to CDATA looking like:

The XML im trying to parse contains a control character 0x2 inside CDATA. I tried to replace it with character reference which led to CDATA looking like:

CDATA section----charcter reference----CDATA section

Now if i try to parse it i get an error message saying: org.xml.sax.SAXParseException: Content is not allo开发者_如何学JAVAwed in prolog.

The original xml looked like:

<?xml version="1.1" encoding="UTF-16"?><CELL><![CDATA[ABCD&#2;EFGH]]></CELL>

I modified it to:

<?xml version="1.1" encoding="UTF-16"?><CELL><![CDATA[ABCD]]>&#2;<![CDATA[EFGH]]></CELL>


Entity definitions are not resolved in CDATA sections, that is why your original example does not work. That the modified example does not work seems to be a SAX parser error in my opinion. Maybe the SAX parser does not allow an invisible byte order mark (BOM) before the XML prolog that starts with <?, but the SAX parser should.

To help the SAX parser the following workaround would eventually do. Namely consuming the BOM before you feed the parser. You could use a markable stream for this purpose, i.e. marking the stream, reading the BOM, reseting the stream to its mark if there was no BOM. I didn't try, its just a guess.

BTW: Your question would be perceived better if you would fix the typo in the intro: Write "character reference" instead of "charcter reference". I first thought that the missing a is related to your question.

0

精彩评论

暂无评论...
验证码 换一张
取 消