开发者

difference between Nokogiri::XML(File.open()) and Nokogiri.parse(open())

开发者 https://www.devze.com 2023-01-06 05:59 出处:网络
I tried openin开发者_运维技巧g xml file using both the ways, but only the latter part worked when I tried to use xpath.

I tried openin开发者_运维技巧g xml file using both the ways, but only the latter part worked when I tried to use xpath.

eg., doc = as in title;

   doc.xpath('//feed/xyz'), worked only when I open the file using parse method.

One thing I noted was, the object when I open using XML:: is Nokogiri::XML::Document, while the latter one was Nokogiri::HTML::Document

Any comments?


Nokogiri uses a simple test to determine whether a document is HTML or XML when you call the generic Nokogiri.parse method. I've seen it return the wrong results, and the best solution is to give Nokogiri a bit more help.

Instead of using parse, use Nokogiri::XML('some xml string') or Nokogiri::HTML('some html string') and it will always do the right thing. See Parsing an HTML / XML Document.

XML, by definition, should validate. Nokogiri is helpful and will try to parse invalid XML (otherwise it couldn't parse HTML), but when it encounters bad XML it will flag the problem using the errors array as a wrapper. If you know a source for your document is reliable then you can skip checking but it's so easy you might as well do something like doc.errors.any? and react if it's true.

You don't say what type of XML you are trying to parse, but there's XML and then there's wanna-be XML. Your Xpath suggests you're trying to parse a feed. I've encountered so many bad XML feeds that I am not surprised you ran into errors. Nokogiri tries to be understanding about real-world conditions, but sometimes that's not enough and you have to tell Nokogiri to be more lenient when parsing. See the options for Nokogiri::XML to get the flags.

You also say in your comment to the selected answer, that the document opens fine in the browser. A browser is not a good measure for whether the document is valid, because browsers do not do validation, and, instead do everything they can to present something readable, even if it isn't actually correct. A parser, like Nokogiri, needs to be a lot more rigid when parsing because there isn't a human brain interpreting the results. Code that is extracting data from XML is not as forgiving about errors, nor should it be.


Nokogiri.parse parses HTML documents, while Nokogiri::XML expects valid XML document. it seems that when XML parsing fails, error is not raised, rather an empty XML document is generated. try puts doc.to_s, you'll probably see "<?xml version=\"1.0\"?>\n"

0

精彩评论

暂无评论...
验证码 换一张
取 消