开发者

How do I use Nokogiri and Ruby to scrape values from HTML with nested tables?

开发者 https://www.devze.com 2023-03-06 06:39 出处:网络
I am trying to extract the name, ID, Phone, Email, Gender, Ethnicity, DOB, Class, Major, School and GPA from a page I am parsing with Nokogiri.

I am trying to extract the name, ID, Phone, Email, Gender, Ethnicity, DOB, Class, Major, School and GPA from a page I am parsing with Nokogiri.

I tried some different xpath's but everything I try grabs much more than I want:

<span class="subTi开发者_C百科tle"><b>Recruit Profile</b></span>
<br><table border="0" width="100%"><tr>
<td>
      <table bgcolor="#afafaf" border="0" cellpadding="0" width="100%">
<tr>
<td>
      <table bgcolor="#cccccc" border="0" cellpadding="2" cellspacing="2" width="100%">
<tr>
<td bgcolor="#dddddd"><b>Name</b></td>
          <td bgcolor="#dddddd">Some Person</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>EDU ID</b></td>
          <td bgcolor="#dddddd">A12345678</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Phone</b></td>
          <td bgcolor="#dddddd">123-456-7890</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Address</b></td>
          <td bgcolor="#dddddd">1234 Somewhere Dr.<br>City ST, 12345</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Email</b></td>
          <td bgcolor="#dddddd">someone@email.com</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Gender</b></td>
          <td bgcolor="#dddddd">Female</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Ethnicity</b></td>
          <td bgcolor="#dddddd">Unknown</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Date of Birth</b></td>
          <td bgcolor="#dddddd">Jan 1st, 1901</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Class</b></td>
          <td bgcolor="#dddddd">Sophomore</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>Major</b></td>
          <td bgcolor="#dddddd">Biology</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>School</b></td>
          <td bgcolor="#dddddd">University of Somewhere</td>
        </tr>
<tr>
<td bgcolor="#dddddd"><b>GPA</b></td>
          <td bgcolor="#dddddd">0.00</td>
        </tr>
<tr>
<td bgcolor="#dddddd" valign="top"><b>Availability</b></td>
          <td bgcolor="#dddddd">
      <table border="0" cellspacing="0" cellpadding="0">
<tr>


I assume that there will be many "Recruit Profile" spans that are followed by tables that wrap up all the details. The following method takes your entire HTML page, finds just those spans, and for each of them it finds the following table and then finds the fields you want anywhere below that table:

require 'nokogiri'

# Pass in or set the array of labels you want to use
# Returns an array of hashes mapping these labels to the values
def recruits_details(html,fields=%W[Name #{"EDU ID"} Phone Email Gender])
  doc = Nokogiri::HTML(html)
  recruit_labels = doc.xpath('//span[b[text()="Recruit Profile"]]')
  recruit_labels.map do |recruit_label|
    recruit_table = recruit_label.at_xpath('following-sibling::table')
    Hash[ fields.map do |field_label|
      label_td = recruit_table.at_xpath(".//td[b[text()='#{field_label}']]")
      [field_label, label_td.at_xpath('following-sibling::td/text()').text ]
    end ]
  end
end

require 'pp'
pp recruits_details(html_string)
#=> [{"Name"=>"Some Person",
#=>   "EDU ID"=>"A12345678",
#=>   "Phone"=>"123-456-7890",
#=>   "Email"=>"someone@email.com",
#=>   "Gender"=>"Female"}]

An XPath expression like .//foo[bar[text()="jim"]] means:

  • Find a 'foo' element anywhere under the current node
  • ...but only if it has a 'bar' element as a child
  • ...but only if that 'bar' element has the text "jim" as its content

An XPath expression like following-sibling::... means Find any elements that are siblings after the current node that match the expression ...

The XPath expression .../text() selects the Text node; the text method is used to extract the value (actual string) of that text node.

Nokogiri's xpath method returns an array of all elements matching the expression, while the at_xpath method returns the first element matching the expression.

0

精彩评论

暂无评论...
验证码 换一张
取 消