Parsing large XML response in Java_问答_开发者

开发者 https://www.devze.com 2023-03-12 15:07 出处：网络

I have a Java program that makes a request to a web service that I do not have the ability to modify.The response from one of the requests can be extremely large, to the point where the heap runs out

I have a Java program that makes a request to a web service that I do not have the ability to modify. The response from one of the requests can be extremely large, to the point where the heap runs out of memory if I try to parse it into a Document object. To get around this, I'm reading the response into a byte[] buffer chunk-by-chunk and writing it to disk. Then I had planned on scanning the file line-by-line and building Document objects out of each element that I find (these are the only elements I need out of the response):

StringBuilder sb = null;
String line = null;

while( (line = reader.readLine()) != null ){
    if(line.trim().equals("<bond>")){
        sb = new StringBuilder(line);
    }
    else if(line.trim().equals("</bond>")){
        Document doc = builder.parse(sb.toString());
        // Process doc
    }
    else{
        sb.append(line);
    }
}

Unfortunately it seems that the newlines are converted to spaces in the response, so everything is one huge line. One solution I'm considering is using SAX to handle the parsing, 开发者_如何学JAVAand build my Document pieces in the same manner. Does anyone have another solution or is this my best bet?

Thanks, Jared

There are different APIs for parsing XML documents in Java. There is the DOM API, which you seem to be using. It reads the whole XML document and converts it to a tree of nodes; you get a Document object which contains all these nodes. The advantage of the DOM API is that it is fairly easy to use, but the disadvantage is that all those nodes can take up a lot of memory if the XML is large, as you have noticed.

There is also the SAX API, which works differently. This works via a callback mechanism: you tell the XML parser that you want to be called whenever it encounters an opening or closing tag or data in the XML file. You then decide in your callback method what you want to do, and you store only the data that you need. The advantage is that this scales to large documents, because the whole XML tree does not need to reside in memory. The disadvantage is that this API is lower level and more cumbersome to use.

There is also StAX which was designed to be something in between the DOM and SAX APIs.

If you need to process large XML documents, it would probably be better to use the SAX or StAX API instead of the DOM API.

If you wanted to use either the SAX or DOM parser, the SAX parser is probably your best bet. It doesn't store the xml in memory so it will be able to handle larger XML files.

If the response is very large, yes, a SAX parser would be suitable, otherwise you will run out of memory again when creating the DOM structure.

I can also recommend the Smooks framework for transforming the XML into other forms. It is well suited for handling very large data sets and has a lot of things pre-built in (http://www.smooks.org). Smooks allows you to specify what parts of the XML structure to use to produce new Java objects, XML, or other things.

I think using a SAXBuilder and XPath may be better than a while loop.
Something on these lines -

Document doc = new SAXBuilder().build(new StringReader(xmlStr));
XPath xPath = XPath.newInstance("/*/YourElement");
Element ele = xPath.selectSingleNode(doc);
ele.getChild("ChildElement");

You could look at a library such as Nux which would enable you to combine XML streaming with XPath to extract just the values you want. It might be worth looking into rather than trying to write something custom.

If the heap size is a problem, you can try to increase it with the following options:

java -Xms64m -Xmx256m

This will give you a starting heap size of 64MB and a maximum of 256MB. You can use other values. This has the advantage of not requiring any code change.