开发者

What are fast XML parsers for Ruby? [closed]

开发者 https://www.devze.com 2023-01-22 02:21 出处:网络
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions se开发者_JAVA百科eking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 7 years ago.

Improve this question

I am using Nokogiri which works for small documents well. But for a 180KB HTML file I have to increase the process stack size, via ulimit -s, and the parsing and XPath queries take a long time.

Are there faster methods available using a stock Ruby distribution?

I am getting used to XPath, but the solution does not necessarily need to support XPath.

The criteria are:

  1. Fast to write.
  2. Fast execution.
  3. Robust resulting parser.


Check out the Ox gem. It is faster than LibXML and Nokogiri and supports in memory parsing as well as SAX callback parsing. Full disclosure, I wrote it.


In the performance comparison http://www.ohler.com/software/thoughts/Blog/Entries/2011/9/21_XML_with_Ruby.html both a DOM (in memory) and SAX (callback) parsers are compared.


Nokogiri is based on libxml2, which is one of the fastest XML/HTML parsers in any language. It is written in C, but there are bindings in many languages.

The problem is that the more complex the file, the longer it takes to build a complete DOM structure in memory. Creating a DOM is slower and more memory-hungry than other parsing methods (generally the entire DOM must fit into memory). XPath relies on this DOM.

SAX is often what people turn to for speed or for large documents that don't fit into memory. It is more event driven: it notifies you of a start element, end element, etc, and you write handlers to react to them. It's a bit of a pain because you end up keeping track of state yourself (e.g. which elements you're "inside").

There is a middle ground: some parsers have a "pull parsing" capability where you have a cursor-like navigation. You still visit each node sequentially, but you can "fast-forward" to the end of an element you're not interested in. It's got the speed of SAX but a better interface for many uses. I don't know if Nokogiri can do this for HTML, but I'd look into its Reader API if you're interested.

Note that Nokogiri is also very lenient with malformed markup (such as real-world HTML) and this alone makes it a very good choice for HTML parsing.


Link to Ox is http://rubygems.org/gems/ox. A discussion of performance differences: http://www.ohler.com/software/thoughts/Blog/Entries/2011/9/21_XML_with_Ruby.html


You may find that for larger XML documents DOM parsing is not very performant. This is because the parser has to build an in-memory map of the structure of the XML document.

The other approach that generally requires a smaller memory footprint is to use an event-driven SAX parser.

Nokogiri has full support for SAX.


Depending on your environment, Oga may be better suited as a fast enough XML parsers for Ruby with a much better interface and faster installation time.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号