开发者

Unclosed tags and Nokogiri

开发者 https://www.devze.com 2023-01-29 06:43 出处:网络
My test html file is here: http://pastebin.com/L88nYbQY As you can see there are some unclo开发者_运维知识库sed input tags, and some self closing ones.

My test html file is here: http://pastebin.com/L88nYbQY

As you can see there are some unclo开发者_运维知识库sed input tags, and some self closing ones.

This causes the following code to return everything from the opening #qcbody div to the end of the file, ignoring the closing div tag.

require 'nokogiri'

f = File.open('t.html', 'r')
@doc = Nokogiri::XML(f)
@doc.at_css('#qcbody').to_html

I'm sure people have gotten around this problem in a variety of ways. How would you do it?


Give this a try:

require 'open-uri'
require 'nokogiri'

@doc = Nokogiri::HTML(File.open('t.html', 'r'))
@doc.at_css('#qcbody').to_html

In IRB:

>> @doc.at_css('#qcbody').to_html
=> "<div id="qcbody">         \r\n    <form method="post" name="form" id="form" action="#">\r\n      <input type="hidden" name="Search Engine" id="Search Engine"><input type="hidden" name="Keyword" id="Keyword"><input type="button" onclick="javascript:validate()" name="sendsubmit" id="sendsubmit" class="submit">\n</form>\r\n    <div class="clear"></div>\r\n  </div>"

The difference between using Nokogiri::XML and Nokogiri::HTML is the leniency when parsing the document. XML is required to validate and be correct. Some XML parsers would reject an XML file that doesn't meet the standard. Nokogiri allows us to set how picky it is. (And in the case of XML, you can look at the errors array after parsing to see if there is a problem.)

For HTML, Nokogiri relaxes the parser so there's a better chance of handling real-world HTML. I've seen it handle some really ugly markup and keep on going when lesser parsers blew their lunch. If you look at Nokogiri::HTML.parse it has options = XML::ParseOptions::DEFAULT_HTML defined, which are the relaxed settings. You can override that if you want to make sure the HTML conforms.


@doc = Nokogiri::HTML.parse(f)
@doc.at('#qcbody').to_html
0

精彩评论

暂无评论...
验证码 换一张
取 消