开发者

Java XML Parsing: Avoid entity reference resolution

开发者 https://www.devze.com 2023-03-31 12:28 出处:网络
I am currently parsing XHTML documents with a DOM parser, like: final Doc开发者_开发百科umentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

I am currently parsing XHTML documents with a DOM parser, like:

final Doc开发者_开发百科umentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setValidating(false);

final DocumentBuilder db = dbf.newDocumentBuilder();
db.setEntityResolver(MY_ENTITY_RESOLVER);
db.setErrorHandler(MY_ERROR_HANDLER);
...
final Document doc = db.parse(inputSource);

And my problem is that when my document contains an entity reference like, for example:

<p>&euro;</p>

My parser creates a Text node for that content containing "€" instead of "&euro;". This is, it is resolving the entity in the way it is supposed to do it (the XHTML 1.0 Strict DTD links to the ENTITIES Latin1 DTD, which in turn establishes the equivalence of "&euro;" with "€").

The problem is, I don't want the parser to do such thing. I would like to keep the "&euro;" text unmodified.

I've already tried with:

final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setExpandEntityReferences(false);

But:

  1. I don't like this because I fear this might make some parser implementations not navigate from the XHTML 1.0 Strict DTD to the ENTITIES Latin1 DTD and therefore not consider "&euro;" as a declared entity.

  2. When I do this, it weirdly creates two nodes: a "pound" Entity node, and a Text node with the "€" symbol after it.

Any ideas? Is it possible to configure this in a DOM Parser without resorting to preprocessing the XHTML and substituting all "&" symbols for something other?...

Solutions could be for a DOM parser or also a SAX one, I wouldn't mind using SAX parsing and then creating my DOM using a transformation...

Also, I cannot switch to a non standard XML parsing libray. No jdom, no jsoup, no HtmlCleaner, etc.

Thanks a lot.


The approach I took was to replace any entities with a unique marker that is treated as plain text by Xerces. Once converted into a Document object, the markers are replaced with Entity Reference objects.

See the convertStringToDocument() function in http://sourceforge.net/p/commonclasses/code/14/tree/trunk/src/com/redhat/ecs/commonutils/XMLUtilities.java

0

精彩评论

暂无评论...
验证码 换一张
取 消