I have a big (1.9 GB) XML file which has data I want to insert into a MySQL database every month. I have made an Ant script for this.
The Ant XSLT task can't handle one file this big, so I have a task that uses xml_split (from xml-twig-tools) to split the 1.9 GB xml file into smaller xml files of roughly 4 MB.
This all goes well.
I use the following Ant xml to run the XSLT task over all these XML files:
<target name="xsltransform" depends="split" description="Transform XML to SQL...">
<xslt basedir="${import.dir}/"
destdir="${import.dir}/sql/"
style="${xsl.filename}" force="true">
<mapper type="glob" from="*.xml" to="*.sql" />
<factory name="net.sf.saxon.TransformerFactoryImpl"/>
</xslt>
</target>
The problem is, as soon as it starts on the first XML file, I see the 'RES' memory in linux top growing with every next XML file. As it is processing multiple (unrelated) xml files, I would suspect it would free up memory in between the translation of each xml file. Well, it doesn't... after two-hundred 4MB xml files, java throws an out-of-memory exception:
BUILD FAILED
/var/lib/hudson/jobs/EPDB_Rebuild_Monthly/workspace/trunk/buildfiles/buildMonthly.xml:67: java.lang.OutOfMemoryError: Java heap space
at net.sf.saxon.tinytree.TinyTree.ensureNodeCapacity(Unknown Source)
at net.sf.saxon.tinytree.TinyTree.addNode(Unknown Source)
at net.sf.saxon.tinytree.TinyBuilder.startElement(Unknown Source)
at net.sf.saxon.event.Stripper.startElement(Unknown Source)
at net.sf.saxon.event.ReceivingContentHandler.startElement(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
at net.sf.saxon.event.Sender.sendSAXSource(Unknown Source)
at net.sf.saxon.event.Sender.send(Unknown Source)
at net.sf.saxon.eve开发者_开发知识库nt.Sender.send(Unknown Source)
at net.sf.saxon.Controller.transform(Unknown Source)
at org.apache.tools.ant.taskdefs.optional.TraXLiaison.transform(TraXLiaison.java:194)
at org.apache.tools.ant.taskdefs.XSLTProcess.process(XSLTProcess.java:812)
at org.apache.tools.ant.taskdefs.XSLTProcess.execute(XSLTProcess.java:408)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:348)
at org.apache.tools.ant.Target.execute(Target.java:390)
at org.apache.tools.ant.Target.performTasks(Target.java:411)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360)
at org.apache.tools.ant.Project.executeTarget(Project.java:1329)
Is there something I can do to prevent the XSLT task eating up all my memory? Or should I reconsider my approach?
We are all going to agree that it should be letting go of the memory, but since it doesn't, you can try breaking up the xslt
task in to seperate calls. e.g., using Ant Contrib's for
task
<for param="file">
<fileset dir="${import.dir}"/>
<sequential>
<xslt in="@{file}"
destdir="${import.dir}/sql/"
style="${xsl.filename}" force="true">
<mapper type="glob" from="*.xml" to="*.sql" />
<factory name="net.sf.saxon.TransformerFactoryImpl"/>
</xslt>
</sequential>
</for>
If that doesn't do the trick, then since you are using Saxon, you can calling Saxon's java classes directly in a forked JVM. e.g.,
<java classname="net.sf.saxon.Transform" failonerror="true" fork="true">
<arg value="-s:${import.dir}" />
<arg value="-xsl:${xsl.filename}" />
<arg value="-o:${import.dir}/sql" />
</java>
or you can try both
<for param="file">
<fileset dir="${import.dir}"/>
<sequential>
<basename property="@{file}.base" file="@{file}" suffix="xml"/>
<java classname="net.sf.saxon.Transform" failonerror="true" fork="true">
<arg value="-s:@{file}" />
<arg value="-xsl:${xsl.filename}" />
<arg value="-o:${import.dir}/sql/${@{file}.base}.sql" />
</java>
</sequential>
</for>
and for bonus points you could try to speed things up a bit by doing it in parallel.
<for param="file">
<fileset dir="${import.dir}"/>
<parallel>
<basename property="@{file}.base" file="@{file}" suffix="xml"/>
<java classname="net.sf.saxon.Transform" failonerror="true" fork="true">
<arg value="-s:@{file}" />
<arg value="-xsl:${xsl.filename}" />
<arg value="-o:${import.dir}/sql/${@{file}.base}.sql" />
</java>
</parallel>
</for>
精彩评论