How to transform huge xml files in java?_问答_开发者

As the title says it, I have a huge xml file (GBs)

<root>  
<keep>  
   <stuff>  ...  </stuff>  
   <morestuff> ... 开发者_如何学JAVA</morestuff>  
</keep>  
<discard>  
   <stuff>  ...  </stuff>  
   <morestuff> ... </morestuff>
</discard>  
</root>

and I'd like to transform it into a much smaller one which retains only a few of the elements.

My parser should do the following:

1. Parse through the file until a relevant element starts.

2. Copy the whole relevant element (with children) to the output file. go to 1.

step 1 is easy with SAX and impossible for DOM-parsers.

step 2 is annoying with SAX, but easy with the DOM-Parser or XSLT.

so what? - is there a neat way to combine SAX and DOM-Parser to do the task?

StAX would seem to be one obvious solution: it's a pull parser rather than either the "push" of SAX or the "buffer the whole thing" approach of DOM. Can't say I've used it though. A "StAX tutorial" search may come in handy :)

Yes, just write a SAX content handler, and when it encounters a certain element, you build a dom tree on that element. I've done this with very large files, and it works very well.

It's actually very easy: As soon as you encounter the start of the element you want, you set a flag in your content handler, and from there on, you forward everything to the DOM builder. When you encounter the end of the element, you set the flag to false, and write out the result.

(For more complex cases with nested elements of the same element name, you'll need to create a stack or a counter, but that's still quite easy to do.)

I made good experiences with STX (Streaming Transformations for XML). Basically, it is a streamed version of XSLT, well suited to parsing huge amounts of data with minimal memory footprint. It has an implementation in Java named Joost.

It should be easy to come up with a STX transform that ignores all elements until the element matches a given XPath, copies that element and all its children (using an identity template within a template group), and continues to ignore elements until the next match.

UPDATE

I hacked together a STX transform that does what I understand you want. It mostly depends on STX-only features like template groups and configurable default templates.

<stx:transform xmlns:stx="http://stx.sourceforge.net/2002/ns"
    version="1.0" pass-through="none" output-method="xml">
    <stx:template match="element/child">
        <stx:process-self group="copy" />
    </stx:template>
    <stx:group name="copy" pass-through="all">
    </stx:group>
</stx:transform>

The pass-through="none" at the stx:transform configures the default templates (for nodes, attributes etc.) to produce no output, but process child elements. Then the stx:template matches the XPath element/child (this is the place where you put your match expression), it "processes self" in the "copy" group, meaning that the matching template from the group name="copy" is invoked on the current element. That group has pass-though="all", so the default templates copy their input and process child elements. When the element/child element is ended, control is passed back to the template that invoked process-self, and the following elements are ignored again. Until the template matches again.

The following is an example input file:

<root>
    <child attribute="no-parent, so no copy">
    </child>
    <element id="id1">
        <child attribute="value1">
            text1<b>bold</b>
        </child>
    </element>
    <element id="id2">
        <child attribute="value2">
            text2
            <x:childX xmlns:x="http://x.example.com/x">
            <!-- comment -->
                yet more<b i="i" x:i="x-i" ></b>
            </x:childX>
        </child>
    </element>
</root>

This is the corresponding output file:

<?xml version="1.0" encoding="UTF-8"?>
<child attribute="value1">
            text1<b>bold</b>
        </child><child attribute="value2">
            text2
            <x:childX xmlns:x="http://x.example.com/x">
            <!-- comment -->
                yet more<b i="i" x:i="x-i" />
            </x:childX>
        </child>

The unusual formatting is a result of skipping the text nodes containing newlines outside the child elements.

Since you're talking about GB's, I would rather prioritize the memory usage in the consideration. SAX needs about 2 times of memory as the document big is, while DOM needs it to be at least 5 times. So if your XML file is 1GB big, then DOM would require a minimum of 5GB of free memory. That's not funny anymore. So SAX (or any variant on it, like StAX) is the best option here.

If you want the most memory efficient approach, look at VTD-XML. It requires only a little more memory than the file big is.

Have a look at StAX, this might be what you need. There's a good introduction on IBM Developer Works.

For such a large XML document something with a streaming architecture, like Omnimark, would be ideal.

It wouldn't have to be anything complex either. An Omnimark script like what's below could give you what you need:

process

submit #main-input

macro upto (arg string) is
    ((lookahead not string) any)*
macro-end

find (("<keep") upto ("</keep>") "</keep>")=>keep
    output keep

find any

You can do this quite easily with an XMLEventReader and several XMLEventWriters from the javax.xml.stream package.