We've been using libxml-ruby for a couple of years. It is fantastic on files of 30 MB or less, but it is PLAGUED by seg faults. 开发者_Go百科Nobody at the project really seems to care to fix them, only to blame these on 3rd party software. That's their prerogative of course, it's free.
Yet I still am unable to read these large files. I suppose I could write some miserable hack to split them into smaller files, but I would like to avoid that. Does anyone else have any experience with reading very large XML files in Ruby?
When loading big files, whether they are xml or not, you should start considering taking pieces at a time(in this case called streaming), rather than loading the entire file into memory.
I would highly suggest reading this article about pull parsers. Using this technique will allow you to read this file with greater ease, rather than loading all of the file at once into memory.
Thanks everyone for your excellent input. I was able to solve my problem by looking at Processing large XML file with libxml-ruby chunk by chunk.
The answer was to avoid the use of:
reader.expand
and to instead use:
reader.read
or:
reader.next
in conjunction with:
reader.node
As long as you aren't trying to store the node as is, it works great. You want to operate on that node immediately, because reader.next will blow it away.
To respond to an earlier answer, from what I can understand libxml-ruby IS a streaming parser. The problems with the seg faults arose in garbage collecting issues which were causing memory leaks galore. Once I learned not to use reader.expand, everything came up roses.
UPDATE:
I was NOT able to solve my problem after all. There appears to be NO WAY to get to the subtree without using reader.expand.
And so I guess there is no way to read a read and parse a large XML file with libxml-ruby? The reader.expand memory leak bug has been open without even a response since 2009? FAIL FAIL FAIL.
I'd recommend looking into a SAX XML parser. They're designed to handle huge files. I haven't needed one in a while, but but they're pretty easy to use; As it reads the XML file in it will pass your code various events, which you catch and handle with your code.
The Nokogiri site has a link to SAX Machine which is based on Nokogiri, so that would be another option. Either way, Nokogiri is very well supported, and used by a lot of people, including me for all HTML and XML I parse. It supports both DOM and SAX parsing, allows use of CSS and XPath accessors, and uses libxml2 for its parsing, so it's fast and based on a standard parsing library.
libxml-ruby indeed has plenty of bugs, not just crashing bugs, but version incompatibilities, memory leaks, etc...
I highly recommend Nokogiri. The Ruby community has rallied around Nokogiri as the new hotness for fast XML parsing. It has a reader pull parser, a SAX parser, and your standard in-memory DOM-ish parser.
For really large XML files, I'd recommend Reader, because it's as fast as SAX, but is easier to program for, because you don't have to keep track of so much state manually.
精彩评论