开发者

Parsing html with rails and nokogiri

开发者 https://www.devze.com 2023-04-04 09:22 出处:网络
I need to parse HTML using Rails and Nokogiri. Here is the HTML: <body> <div id=\"mama\"> <div class=\"test1\">text</div>

I need to parse HTML using Rails and Nokogiri. Here is the HTML:

<body>
  <div id="mama">
    <div class="test1">text</div>
    <div class="test2">text2</div>
  </div>
  <div id="mama">
    <div class="test1">text</div>
    <div class="test2">text2</div>
  </d开发者_如何学编程iv>
  <div id="mama">
    <div class="test1">text</div>
    <div class="test2">text2</div>
  </div>
</body>

How I should form loop question? I've tried so many times but still getting an error or bad results... ...

doc.xpath('//div[@id='mama']/?or what?').each do |node|
  parse_file.puts text1 
  parse_file.puts text2
  parse_file.puts text1 
  parse_file.puts \n
end

Result should be like

text from first mama
text2 from first mama
text from first mama

text from second mama
and so on...


First, note that the HTML you posted is syntactically invalid: it is illegal to have more than one element with the same id attribute value. If you have control over your HTML, you should fix this problem.

Using that same (invalid) HTML, however, Nokogiri still has no trouble:

require 'nokogiri'
doc = Nokogiri::HTML(my_html)

doc.css('#mama').each_with_index do |div,i|
  puts "#{div.at_css('.test1').text} from mama ##{i}"
  puts "#{div.at_css('.test2').text} from mama ##{i}"
end

#=> text from mama #0
#=> text2 from mama #0
#=> text from mama #1
#=> text2 from mama #1
#=> text from mama #2
#=> text2 from mama #2

If you wanted to use XPath directly (as Nokogiri does behind the scenes for the CSS) you would do this:

doc.xpath("//div[@id='mama']").each_with_index do |div,i|
  puts "#{div.at_xpath("./*[@class='test1']").text} from mama ##{i}"
  puts "#{div.at_xpath("./*[@class='test2']").text} from mama ##{i}"
end


For one thing, your apostrophes/quotes are off. They should be...

doc.xpath('//div[@id="mama"]/?or what?')
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号