开发者

How do I replace HTML escapes in an input stream before parsing it to XML?

开发者 https://www.devze.com 2023-01-16 18:20 出处:网络
I have an input stream which is being converted to XML, and read.When I get down to some text elements in the XML, they are truncated. I believe the parser is dropping everything after escaped HTML su

I have an input stream which is being converted to XML, and read. When I get down to some text elements in the XML, they are truncated. I believe the parser is dropping everything after escaped HTML such as & Here is the code getting the input stream and then getting the text element.

String hvurl = "https://www.mysite.com/api/a/" + answerId;
in = OpenHttpConnection(hvurl); 

Document doc = null;
DocumentBuilderFactory dbf = 
    DocumentBuilderFactory.newInstance();
DocumentBuilder db;

try {
    db = dbf.newDocumentBuilder();
    doc = db.p开发者_StackOverflow社区arse(in);

} catch (ParserConfigurationException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
} catch (SAXException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}     

...
//Now when I get the text element, it's truncated
//---get the <varietalTitle> elements under the <varietal> 
// element---
NodeList varietalTitleNodes = 
    (varietalElement).getElementsByTagName("varietaltitle");

//---convert a Node into an Element---
Element varietalTitleElement = (Element) varietalTitleNodes.item(0);

//---get all the child nodes under the <varietaltitle> element---
NodeList varietalTitleTextNodes = 
    ((Node) varietalTitleElement).getChildNodes();

//---retrieve the text of the <varietalid> element---
strVarietalTitle = ((Node) varietalTitleTextNodes.item(0)).getNodeValue();


Cant get where the problem occurs. My guess use the normalize() method as below.

Try this:

 strVarietalTitle = ((Node) varietalTitleTextNodes.item(0)).getNodeValue().normalize();

From documentation Normalize():

Puts Puts all Text nodes in the full depth of the sub-tree underneath this Node, including attribute nodes, into a "normal" form where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes. This can be used to ensure that the DOM view of a document is the same as if it were saved and re-loaded, and is useful when operations (such as XPointer [XPointer] lookups) that depend on a particular document tree structure are to be used. If the parameter "normalize-characters" of the DOMConfiguration object attached to the Node.ownerDocument is true, this method will also fully normalize the characters of the Text nodes. Note: In cases where the document contains CDATASections, the normalize operation alone may not be sufficient, since XPointers do not differentiate between Text nodes and CDATASection nodes.


An XML parser should cope with character entities such as "&" ... assuming that's what you are talking about.

One possibility is that your input contains particular named entities that the XML parser doesn't know about.

0

精彩评论

暂无评论...
验证码 换一张
取 消