How do I parse a plain HTML table with Nokogiri?_问答_开发者

I'd like to parse a HTML page with the Nokogiri. There is a table in part of the page which does not use any specific ID. Is it possible to extract something like:

Today,3,455,34
Today,1,1300,3664
Today,10,100000,3444,
Yesterday,3454,5656,3
Yesterday,3545,1000,10
Yesterday,3411,36223,15

From this HTML:

<div id="__DailyStat__">
  <table>
    <tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
    <tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
    <tr class="blr">
      <td>3</td>
      <td>455</td>
      <td>34</td>
      <td class="r">3454</td>
      <td class="r">5656</td>
      <td class="r">3</td>
    </tr>

    <tr class="bla">
      <td>1</td>
      <td>1300</td>
      <td>3664</td&g开发者_StackOverflow社区t;
      <td class="r">3545</td>
      <td class="r">1000</td>
      <td class="r">10</td>
    </tr>

    <tr class="blr">
      <td>10</td>
      <td>100000</td>
      <td>3444</td>
      <td class="r">3411</td>
      <td class="r">36223</td>
      <td class="r">15</td>
    </tr>
  </table>
</div>

As a quick and dirty first pass I'd do:

html = <<EOT
<div id="__DailyStat__">
  <table>
    <tr class="blh"><th colspan="3">Today</th><th class="r" colspan="3">Yesterday</th></tr>
    <tr class="blh"><th>Qnty</th><th>Size</th><th>Length</th><th class="r">Length</th><th class="r">Size</th><th class="r">Qnty</th></tr>
    <tr class="blr">
      <td>3</td>
      <td>455</td>
      <td>34</td>
      <td class="r">3454</td>
      <td class="r">5656</td>
      <td class="r">3</td>
    </tr>

    <tr class="bla">
      <td>1</td>
      <td>1300</td>
      <td>3664</td>
      <td class="r">3545</td>
      <td class="r">1000</td>
      <td class="r">10</td>
    </tr>

    <tr class="blr">
      <td>10</td>
      <td>100000</td>
      <td>3444</td>
      <td class="r">3411</td>
      <td class="r">36223</td>
      <td class="r">15</td>
    </tr>
  </table>
</div>
EOT

#    Today              Yesterday
#    Qnty Size   Length Length Size  Qnty
#    3    455    34     3454   5656  3
#    1    1300   3664   3545   1000  10
#    10   100000 3444   3411   36223 15


require 'nokogiri'

doc = Nokogiri::HTML(html)

Use CSS to find the start of the table, and define some places to hold the data we're capturing:

table = doc.at('div#__DailyStat__ table')

today_data     = []
yesterday_data = []

Loop over the rows in the table, rejecting the headers:

table.search('tr').each do |tr|

  next if (tr['class'] == 'blh')

Initialize arrays to capture the pertinent data from each row, selectively push the data into the appropriate array:

  today_td_data     = [ 'Today'     ]
  yesterday_td_data = [ 'Yesterday' ]

  tr.search('td').each do |td|
    if (td['class'] == 'r')
      yesterday_td_data << td.text.to_i
    else
      today_td_data << td.text.to_i
    end
  end

  today_data     << today_td_data
  yesterday_data << yesterday_td_data

end

And output the data:

puts today_data.map{ |a| a.join(',') }
puts yesterday_data.map{ |a| a.join(',') }

> Today,3,455,34
> Today,1,1300,3664
> Today,10,100000,3444
> Yesterday,3454,5656,3
> Yesterday,3545,1000,10
> Yesterday,3411,36223,15

Just to help you visualize what's going, at the exit from the "tr" loop, the today_data and yesterday_data arrays are arrays-of-arrays looking like:

[["Today", 3, 455, 34], ["Today", 1, 1300, 3664], ["Today", 10, 100000, 3444]]

Alternatively, instead of looping over the "td" tags and sensing the class for the tag, I could have grabbed the contents of the "tr" and then used scan to grab the numbers and sliced the resulting array into "today" and "yesterday" arrays:

  tr_data = tr.text.scan(/\d+/).map{ |i| i.to_i }

  today_td_data     = [ 'Today',     *tr_data[0, 3] ]
  yesterday_td_data = [ 'Yesterday', *tr_data[3, 3] ]

In real-world development, like at work, I'd use that instead of what I first wrote because it's succinct.

And notice that I didn't use XPath. It's very doable in Nokogiri to use XPath and accomplish this, but for simplicity I prefer CSS accessors. XPath would have allowed accessing individual "td" tag contents, but it also would begin to look like line-noise, which is something we want to avoid when writing code, because it impacts maintenance. I could also have used CSS to drill down to the correct "td" tags like 'tr td.r', but I don't think it would improve the code, it would just be an alternate way of doing it.