Doing XML extracts with XSLT without having to read the whole DOM tree into memory?_问答_开发者

I have a situation where I want to extract some information from some very large but regular XML files (just had to do it with a 500 Mb file), and wher开发者_JAVA百科e XSLT would be perfect.

Unfortunately those XSLT implementations I am aware of (except the most expensive version of Saxon) does not support only having the necessary part of the DOM read in but reads in the whole tree. This cause the computer to swap to death.

The XPath in question is

//m/e[contains(.,'foobar')

so it is essentially just a grep.

Is there an XSLT implementation which can do this? Or an XSLT implementation which given suitable "advice" can do this trick of pruning away the parts in memory which will not be needed again?

I'd prefer a Java implementation but both Windows and Linux are viable native platforms.

EDIT: The input XML looks like:

<log>
<!-- Fri Jun 26 12:09:27 CEST 2009 -->
<e h='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'>
<m>Registering Catalina:type=Manager,path=/axsWHSweb-20090626,host=localhost</m></e>
<e h='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'>
<m>Force random number initialization starting</m></e>
<e h='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'>
<m>Getting message digest component for algorithm MD5</m></e>
<e h='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'>
<m>Completed getting message digest component</m></e>
<e h='12:09:27,284' l='org.apache.catalina.session.ManagerBase' z='1246010967284' t='ContainerBackgroundProcessor[StandardEngine[Catalina]]' v='10000'>
<m>getDigest() 0</m></e>
......
</log>

Essentialy I want to select some m-nodes (and I know the XPath is wrong for that, it was just a quick hack), but maintain the XML layout.

EDIT: It appears that STX may be what I am looking for (I can live with another transformation language), and that Joost is an implementation hereof. Any experiences?

EDIT: I found that Saxon 6.5.4 with -Xmx1500m could load my XML, so this allowed me to use my XPaths right now. This is just a lucky stroke so I'd still like to solve this generically - this means scriptable which in turn means no handcrafted Java filtering first.

EDIT: Oh, by the way. This is a log file very similar to what is generated by the log4j XMLLayout. The reason for XML is to be able to do exactly this, namely do queries on the log. This is the initial try, hence the simple question. Later I'd like to be able to ask more complex questions - therefore I'd like the query language to be able to handle the input file.

Consider VTD-XML. It is much more memory efficient. You can find an API here and benchmarks here.

Doing XML extracts with XSLT without having to read the whole DOM tree into memory?

Note that the last graph says that DOM uses at minimum 5x as many memory as the XML file big is. It is after all really astonishing, isn't it?

As a bonus, it is also faster in parsing and Xpath as opposed to DOM and JDK:

Doing XML extracts with XSLT without having to read the whole DOM tree into memory?

_{(source: sourceforge.net)}

You should be able to implement this without a full table scan. The '//' operator means find an element in the tree at any level. It is pretty expensive to run especially on a document of your size. If you optimize your XPath query or considering setting up match templates, the XSLT transformer may not need to load the entire document into memory.

Based on your XML sample, you are looking to match /log/e/m[ ... predicate ...]. That should be able to be optimized by some XSLT processors to not scan the full document where // would not be.

Since your XML document is pretty simple, it might be easier to not use XSLT at all. STaX is a great streaming API for handling large XML documents. Dom4j also has good support for an XPath like query against large documents. Info on using dom4j for large documents is here: http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc

Sample from the above source:

SAXReader reader = new SAXReader();
reader.addHandler( "/ROWSET/ROW", 
    new ElementHandler() {
        public void onStart(ElementPath path) {
            // do nothing here...    
        }
        public void onEnd(ElementPath path) {
            // process a ROW element
            Element row = path.getCurrent();
            Element rowSet = row.getParent();
            Document document = row.getDocument();
            ...
            // prune the tree
            row.detach();
        }
    }
);

Document document = reader.read(url);

// The document will now be complete but all the ROW elements
// will have been pruned.
// We may want to do some final processing now
...

The Enterprise Edition of the Saxon XSLT Processor supports streaming of large documents for exactly this type of problem.

I had the same problem and did not want to write any Java code. I managed to solve this with STX via Joost.

As per spec:

an STX process may split a large XML document into smaller fragments, pass each of these fragments to an external filter (for example an XSLT processor), and combine the results into a large XML result document.

That's exactly what I needed. Largest example of XML file I have is 1.5 GB, and I had an XSLT template to process it. When using Saxon free edition, it consumed over 3GB memory when processing. With Joost it took less than 90MB.

My XML file contains a large list of products and each of them has a complex XML structure. So I did not want to re-implement my XSLT in STX, but wanted just to split processing per products, while using the same XSLT for each product.

Here are code details, hope it will be helpful for somebody.

Original XSLT file (it was the first XSLT I implemented, so sorry for bad usage of for-each statements):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions">
  <xsl:template match="/">
    <xsl:for-each select="Products/Product">
      <!-- Some XSL statements relative to "Product" element -->
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

I converted it into following STX:

<?xml version="1.0" encoding="UTF-8"?>

<stx:transform version="1.0"
    output-method="text"
    output-encoding="UTF-8"
    xmlns:stx="http://stx.sourceforge.net/2002/ns">

  <stx:buffer name="xslt-product">

    <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions">
      <xsl:template match="Product">
        <!-- The same XSL statements relative to "Product" element -->
      </xsl:template>
    </xsl:stylesheet>

  </stx:buffer>

  <stx:template match="/">
    <stx:process-children />
  </stx:template>

  <stx:template match="Product">
    <stx:process-self filter-method="http://www.w3.org/1999/XSL/Transform"
                      filter-src="buffer(xslt-product)" />
  </stx:template>

</stx:transform>

When running Joost I still had to add Saxon libraries, as I use functions in my XSLT, so I needed XSLT 2.0 support. In the end the command to run transformation was like this:

java -Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl -cp joost.jar:commons-discovery-0.5.jar:commons-logging-1.1.1.jar:saxon9he.jar net.sf.joost.Main my-source.xml my-convert.stx

The bottom line is that now I can run the transformation on low-memory servers without having implemented any Java code or re-implement original XSLT rules!

This is a stab in the dark, and maybe you'll laugh me out of the house.

Nothing stops you from connecting a SAX source to the input of your XSLT; and it is at least in theory easy enough to do your grep from a SAX stream without needing a DOM. So... wanna give that a try?

Try the CAX parser from xponentsoftware. It is a fast xml parser built on Microsoft's xmlreader. It gives the full path as you parse each element, so you could check if the path ="m/e" and then check if the text node contains "foo"

I'm not a Java guy, and I don't know if the tools I'd use to do this in .NET have analogs in the Java world.

To solve this problem in .NET, I'd derive a class from XmlReader, and have it only return the elements that I'm interested in. Then I can use the XmlReader as the input for any XML object, like an XmlDocument or an XslCompiledTransform. The XmlReader subclass basically pre-processes the input stream, making it look like a much, much smaller XML document to whatever class is using it to read from.

It seems like the technique described here is analogous. But I am, as I say, not a Java guy.

STX contains a streamable subset of XPath, called STXPath I believe; I should remember, because I co-wrote the spec :-)

You could definitely pick up Joost and extract the relevant bits, but note that STX didn't get wide industry acceptance, so you need to do some due diligence as to the current stability and support of the tool.

You could do it via STX/Joost as already suggested, but note that many XSLT implementations have a SAX streaming mode and don't need to keep everything in memory. You just need to make sure you your XSLT file isn't looking in any of the wrong axis.

However if I were you and really wanted performance I'd do it in STaX. It's simple, standard and fast. It comes out of the box in java 6, although you can also use Woodstox for a slightly better implementation.

For the xpath you listed the implementation is trivial. The downside is that you've more code to maintain and it's just not as expressive and highlevel as XPath, as you would have in Joost or XSLT.

Write an xslt to return values in your preferred xml layout containing only the values you need from largeXmls.

However, If you want to further process the values in Java, then:

convert that simple xml into a POJO and read values (preferred option)
use Regex to extract values

Example of using StreamSource to parse xml through xslt :

Package used:

import javax.xml.transform.Source;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import java.io.File;
import java.io.StringReader;
import java.io.StringWriter;

Code:

        String xmlStr = "<A><b>value</b><c>value</c></A>";
        File xslt = new ClassPathResource("xslt/Transformer.xslt").getFile();
        Source xsltSource = new StreamSource(xslt);
        Source xmlSource = new StreamSource(new StringReader(xmlStr));
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer(xsltSource);
        StringWriter stringWriter = new StringWriter();
        transformer.transform(xmlSource, new StreamResult(stringWriter));
        String response = stringWriter.toString();