开发者

XML parsing with SAX | how to handle special characters?

开发者 https://www.devze.com 2022-12-23 21:07 出处:网络
We have a JAVA application that pulls the data from 开发者_StackOverflow中文版SAP, parses it and renders to the users.

We have a JAVA application that pulls the data from 开发者_StackOverflow中文版SAP, parses it and renders to the users. The data is pulled using JCO connector.

Recently we were thrown an exception:

org.xml.sax.SAXParseException: Character reference "&#00" is an invalid XML character.

So, we are planning to write a new level of indirection where ALL special/illegal characters are replaced BEFORE parsing the XML.

My questions here are :

  1. Is there any existing(open source) utility that does this job of replacing illegal characters in XML?
  2. Or if I had to write such utility, how should i handle them?
  3. Why is the above exception thrown?

Thank You.


From my point of view, the source (SAP) should do the replacement. Otherwise, what it transmits to your programm may looks like XML, but is not.

While replacing the '&' by '&' can be done by a simple String.replaceAll(...) to the string from to toXML() call, others characters can be harder to replace (the '<' and '>' for exemple).

regards Guillaume


It sounds like a bug in their escaping. Depending on context you might be best off just writing your own version of their XMLWriter class that uses a real XML library rather than trying to write your own XML utilities like the SAP developers did.

Alternatively, looking at the character code, &#00, you might be able to get away with a replace all on it with the empty string:

String goodXml = badXml.replaceAll("&#00;", "");


I've had a related, but opposite problem, where I was trying to insert character 1 into the output of an XSLT transformation. I considered post-processing to replace a marker with the zero, but instead chose to use an xsl:param.

If I was in your situation, I'd either come up with a bespoke encoding, replacing the characters which are invalid in XML, and handling them as special cases in your parsing, or if possible, replace them with whitespace.

I don't have experience with JCO, so can't advise on how or where I'd replace the invalid characters.


You can encode/decode non-ASCII characters in XML by using the Apache Commons Lang class StringEscapeUtils escapeXML method. See:

http://commons.apache.org/lang/api-2.4/index.html

To read about how XML character references work, search for "numeric character references" on wikipedia.

0

精彩评论

暂无评论...
验证码 换一张
取 消