Parallel XML Parsing in Java_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-25 08:54 出处：网络

I\'m writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox (Event API) to parse a file with 22.000 Nodes.

I'm writing an application which processes a lot of xml files (>1000) with deep node structures. It takes about six seconds with with woodstox (Event API) to parse a file with 22.000 Nodes.

The algorithm is placed in a process with user interaction where only a few seconds response time are acceptable. So I need to improve the strategy how to handle the xml files.

My process analyses the xml files (extracts only a few nodes).
Extracted nodes are processed and the new result is written into a new data stream (resulting in a copy of the document with modified nodes).

Now I'm thinking about a multithreaded开发者_运维问答 solution (which scales better on 16 Core+ hardware). I thought about the following stategies:

Creating multiple parsers and running them in parallel on the xml sources.
Rewriting my parsing algorithm thread-save to use only one instance of the parser (factories, ...)
Split the XML source into chunks and assign the chunks to multiple processing threads (map-reduce xml - serial)
Optimizing my algorithm (better StAX parser than woodstox?) / Using a parser with build-in concurrency

I want to improve both, the performance overall and the "per file" performance.

Do you have experience with such problems? What is the best way to go?

This one is obvious: just create several parsers and run them in parallel in multiple threads.
Take a look at Woodstox Performance (down at the moment, try google cache).
This can be done IF structure of your XML is predictable: if it has a lot of same top-level elements. For instance:
```
<element>
    <more>more elements</more>
</element> 
<element>
    <other>other elements</other>
</element>
```
In this case you could create simple splitter that searches <element> and feeds this part to a particular parser instance. That's a simplified approach: in real life I'd go with RandomAccessFile to find start stop points (<element>) and then create custom FileInputStream that just operates on a part of file.
Take a look at Aalto. The same guys that created Woodstox. This are experts in this area - don't reinvent the wheel.

I am agree with Jim. I think that if you want to improve performance of overall processing of 1000 files your plan is good except #3 that is irrelevant in this case. If however you want to improve performance of parsing of single file you have a problem. I do not know how it is possible to split XML file without it parsing. Each chunk will be illegal XML and your parser will fail.

I believe that improving overall time is good enough for you. In this case read this tutorial: http://download.oracle.com/javase/tutorial/essential/concurrency/index.html then create thread pool of for example 100 threads and queue that contains XML sources. Each thread will parse only 10 files that will bring serious performance benefit in multi-CPU environment.

In addition to existing good suggestions there is one rather simple thing to do: use cursor API (XMLStreamReader), NOT Event API. Event API adds 30-50% overhead without (just IMO) significantly making processing easire. In fact, if you want convenience, I would recommend using StaxMate instead; it builds on top of Cursor API without adding significant overhead (at most 5-10% compared to hand-written code).

Now: I assume you have done basic optimizations with Woodstox; but if not, check out "3 Simple Rules for Fast XML-processing using Stax". Specifically, you absolutely should:

Make sure you only create XMLInputFactory and XMLOutputFactory instances once
Close readers and writers to ensure buffer recycling (and other useful reuse) works as expected.

The reason I mention this is that while these make no functional difference (code works as expected) they can make big performance difference; although more so when processing smaller files.

Running multiple instances does also make sense; although usually with at most 1 thread per core. However you will only get benefit as long as your storage I/O can support such speeds; if disk is the bottleneck this will not help and can in some cases hurt (if disk seeks compete). But it is worth a try.