开发者

XML library optimized for big XML with memory constraints

开发者 https://www.devze.com 2023-03-31 19:38 出处:网络
I need to handle big XML files, but I want to make relatively small set of changes to it. I also want the program to adhere strict 开发者_如何学JAVAmemory constraints. We must never use more than, say

I need to handle big XML files, but I want to make relatively small set of changes to it. I also want the program to adhere strict 开发者_如何学JAVAmemory constraints. We must never use more than, say, 300Mb of ram.

Is there a library that allows me not to keep all the DOM in memory, and parse the XML on the go, while I traverse the DOM?

I know you can do that with call-back based approach, but I don't want that. I want to have my cake and eat it too. I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.

There are two possible approaches I thought of for this problem:

  1. Parse the lazily XML, each call to getChildren() will parse the next bit of XML.
  2. Parse the entire XML tree, but cache whatever you're not using right now on the disk.

Two of the approaches are acceptable, is there an existing solution.

I'm looking for a native solution, but I'll be interested with hearing about libraries in other languages.


It sounds like what you want is something similar to the Streaming API for XML (StAX).

While it does not use the standard DOM API, it is similar in principle to your "getChildren()" approach. It does not have the memory overheads of the DOM approach, nor the complexity of the callback (SAX) approach.

There are a number of implementations linked on the Wikipedia page for StAX most of which are for Java, but there are a couple for C++ too - Ambiera irrXML and Llamagraphics LlamaXML.


edit: Since you mention "small changes" to the document, if you don't need to use the document contents for anything else, you might also consider Streaming Transformations for XML (STX) (described in this XML.com introduction to STX). STX is to XSLT something like what SAX/StAX is to DOM.


I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.

You want a streaming DOM-style API? Such a thing generally does not exist, and for good reason: it would be difficult if not impossible to make it actually work.

XML is generally intended to be read one-way: from front to back. What you're suggesting would require being able to random-access an XML file.

I suppose you could do something where you build a table of elements, with file offsets pointing to where that element is in the file. But at that point, you've already read and parsed the file more or less. Unless most of your data is in text elements (which is entirely possible), you may as well be using a DOM.

Really, you would be much better off just rewriting your existing code to use an xmlReader or SAX-style API.


How to do streaming transformations is a big, open, unsolved problem. There are numerous partial solutions, depending on what restrictions you are prepared to accept. Current releases of Saxon-EE, for example, have the capability to do some XSLT transformations in a streaming fashion: see http://www.saxonica.com/html/documentation/sourcedocs/streaming.html. Also, as already mentioned, there is STX (though implementations are not especially mature).

Your title suggests you want to write the transformation in C++. That's severely limiting, because it pretty well means the programmer has to cope with the complexities rather than leaving it to the transformation engine. You can of course hand-code streaming transformations using SAX-like or StAX-like parser APIs, but both are hard work, and each case will need to be approached from scratch.

Google for "streaming XML transformation"

0

精彩评论

暂无评论...
验证码 换一张
取 消