How does Nokogiri handle unclosed HTML tags like ?_问答_开发者

How does Nokogiri handle unclosed HTML tags like ?

开发者 https://www.devze.com 2023-03-29 01:49 出处：网络

When parsing HTML document, how Nokogiri handle tags? Suppose we have document that looks like this one:

相关专题：nokogiri ruby

When parsing HTML document, how Nokogiri handle   tags? Suppose we have document that looks like this one:

<div>
   Hi <br>
   How are you? <br>
</div>

Do Nokogiri know that   tags are something special not just regular XML tags and ignore them when parsing node feed? I think Nokogiri is that smart, but开发者_如何转开发 I want to make sure before I accept this project involving scraping site written as HTML4. You know what I mean (How are you? is not a content of the first   as it would be in XML).

Here's how Nokogiri behaves when parsing (malformed) XML:

require 'nokogiri'
doc = Nokogiri::XML("<div>Hello<br>World</div>")
puts doc.root
#=> <div>Hello<br>World</br></div>

Here's how Nokogiri behaves when parsing HTML:

require 'nokogiri'
doc = Nokogiri::HTML("<div>Hello<br>World</div>")
puts doc.root
#=> <html><body><div>Hello<br>World</div></body></html>

p doc.at('div').text
#=> "HelloWorld"

I'm assuming that by "something special" you mean that you want Nokogiri to treat it like a newline in the source text. A   is not something special, and so appropriately Nokogiri does not treat it differently than any other element.

If you want it to be treated as a newline, you can do this:

doc.css('br').each{ |br| br.replace("\n") }
p doc.at('div').text
#=> "Hello\nWorld"

Similarly, if you wanted a space instead:

doc.css('br').each{ |br| br.replace(" ") }
p doc.at('div').text
#=> "Hello World"

You must parse this fragment using the HTML parser, as obviously this is not valid XML. When using the HTML one, Nokogiri then behaves as you'd expect it:

require 'nokogiri'

doc = Nokogiri::HTML(<<-EOS
<div>
   Hi <br>
   How are you? <br>
</div>
EOS
)

doc.xpath("//br").each{ |e| puts e }

prints

<br>
<br>

Mechanize is based on Nokogiri for doing web scraping, so it is quite appropriate for the task.

As far as I can remember from doing some HTML parsing last year it'll view them as separate.

EDIT: My bad, I've just got someone to send me the code and retested it, we ended up dealing with somethings including   separately.

How does Nokogiri handle unclosed HTML tags like <br>?

精彩评论

关注公众号

热门标签

图文推荐

How does Nokogiri handle unclosed HTML tags like <br>?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：