Unclosed tags and Nokogiri_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-29 06:43 出处：网络

My test html file is here: http://pastebin.com/L88nYbQY As you can see there are some unclo开发者_运维知识库sed input tags, and some self closing ones.

My test html file is here: http://pastebin.com/L88nYbQY

As you can see there are some unclo开发者_运维知识库sed input tags, and some self closing ones.

This causes the following code to return everything from the opening #qcbody div to the end of the file, ignoring the closing div tag.

require 'nokogiri'

f = File.open('t.html', 'r')
@doc = Nokogiri::XML(f)
@doc.at_css('#qcbody').to_html

I'm sure people have gotten around this problem in a variety of ways. How would you do it?

Give this a try:

require 'open-uri'
require 'nokogiri'

@doc = Nokogiri::HTML(File.open('t.html', 'r'))
@doc.at_css('#qcbody').to_html

In IRB:

>> @doc.at_css('#qcbody').to_html
=> "<div id="qcbody">         \r\n    <form method="post" name="form" id="form" action="#">\r\n      <input type="hidden" name="Search Engine" id="Search Engine"><input type="hidden" name="Keyword" id="Keyword"><input type="button" onclick="javascript:validate()" name="sendsubmit" id="sendsubmit" class="submit">\n</form>\r\n    <div class="clear"></div>\r\n  </div>"

The difference between using Nokogiri::XML and Nokogiri::HTML is the leniency when parsing the document. XML is required to validate and be correct. Some XML parsers would reject an XML file that doesn't meet the standard. Nokogiri allows us to set how picky it is. (And in the case of XML, you can look at the errors array after parsing to see if there is a problem.)

For HTML, Nokogiri relaxes the parser so there's a better chance of handling real-world HTML. I've seen it handle some really ugly markup and keep on going when lesser parsers blew their lunch. If you look at Nokogiri::HTML.parse it has options = XML::ParseOptions::DEFAULT_HTML defined, which are the relaxed settings. You can override that if you want to make sure the HTML conforms.

@doc = Nokogiri::HTML.parse(f)
@doc.at('#qcbody').to_html

Unclosed tags and Nokogiri

精彩评论

关注公众号

热门标签

图文推荐

Unclosed tags and Nokogiri

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：