My test html file is here: http://pastebin.com/L88nYbQY
As you can see there are some unclo开发者_运维知识库sed input tags, and some self closing ones.
This causes the following code to return everything from the opening #qcbody div to the end of the file, ignoring the closing div tag.
require 'nokogiri'
f = File.open('t.html', 'r')
@doc = Nokogiri::XML(f)
@doc.at_css('#qcbody').to_html
I'm sure people have gotten around this problem in a variety of ways. How would you do it?
Give this a try:
require 'open-uri'
require 'nokogiri'
@doc = Nokogiri::HTML(File.open('t.html', 'r'))
@doc.at_css('#qcbody').to_html
In IRB:
>> @doc.at_css('#qcbody').to_html
=> "<div id="qcbody"> \r\n <form method="post" name="form" id="form" action="#">\r\n <input type="hidden" name="Search Engine" id="Search Engine"><input type="hidden" name="Keyword" id="Keyword"><input type="button" onclick="javascript:validate()" name="sendsubmit" id="sendsubmit" class="submit">\n</form>\r\n <div class="clear"></div>\r\n </div>"
The difference between using Nokogiri::XML
and Nokogiri::HTML
is the leniency when parsing the document. XML is required to validate and be correct. Some XML parsers would reject an XML file that doesn't meet the standard. Nokogiri allows us to set how picky it is. (And in the case of XML, you can look at the errors
array after parsing to see if there is a problem.)
For HTML, Nokogiri relaxes the parser so there's a better chance of handling real-world HTML. I've seen it handle some really ugly markup and keep on going when lesser parsers blew their lunch. If you look at Nokogiri::HTML.parse
it has options = XML::ParseOptions::DEFAULT_HTML
defined, which are the relaxed settings. You can override that if you want to make sure the HTML conforms.
@doc = Nokogiri::HTML.parse(f)
@doc.at('#qcbody').to_html
精彩评论