I know开发者_开发技巧 there are some very good Perl XML parsers like XML::Xerces, XML::Parser::Expat, XML::Simple, XML::RapidXML, XML::LibXML, XML::Liberal, etc.
Which XML parser would you select for parsing large files and on what parameter would you decide one over another? If the one you would like to select is not in list then please suggest it.
If you're parsing files of that size, you'll want to avoid any parser that tries to load the entire document in memory and construct a DOM (domain object model).
Instead, look for a SAX style parser - one that treats the input file as a stream, raising events when events and attributes are encountered. This approach allows you to process the file gradually, without having to hold the entire thing in memory at once.
With a 15 GB file, your parser would have to be SAX based because with such file sizes, simply being able to process the data is your first task.
I recommend you read XML::SAX::Intro.
A SAX parser is one option. Other options that don't involve loading the entire doc into memory are XML::Twig and XML::Rules.
For parsing such files I always used XML::Parser. Simple, accessible anywhere and working well.
You could also consider using a database with XML extensions (see here for an example). You could do a bulk load of XML data into the database, then you can do SQL queries (or XQueries) on that data.
As you would expect I would suggest XML::Twig, which will let you process the file chunk-by-chunk. This of course assumes that you can process your file this way. It will probably be easier to use than SAX, as you can process the tree for each chunk with DOM-like methods.
An alternative would be to use the pull parser mode, which is a little similar to what XML::Twig offers.
I'm going for a mutated version of tster's answer above. Load the bloody thing into a DB (if possible, via direct XML import, if not, by using SAX parser to parse the file and produce loadable data sets). Then, use the DB as the data store. At 15G, you are pushing way beyond the size of data that should be manipulated on outside of DB.
精彩评论