I am trying to extract the name, ID, Phone, Email, Gender, Ethnicity, DOB, Class, Major, School and GPA from a page I am parsing with Nokogiri.
I tried some different xpath's but everything I try grabs much more than I want:
<span class="subTi开发者_C百科tle"><b>Recruit Profile</b></span>
<br><table border="0" width="100%"><tr>
<td>
<table bgcolor="#afafaf" border="0" cellpadding="0" width="100%">
<tr>
<td>
<table bgcolor="#cccccc" border="0" cellpadding="2" cellspacing="2" width="100%">
<tr>
<td bgcolor="#dddddd"><b>Name</b></td>
<td bgcolor="#dddddd">Some Person</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>EDU ID</b></td>
<td bgcolor="#dddddd">A12345678</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Phone</b></td>
<td bgcolor="#dddddd">123-456-7890</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Address</b></td>
<td bgcolor="#dddddd">1234 Somewhere Dr.<br>City ST, 12345</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Email</b></td>
<td bgcolor="#dddddd">someone@email.com</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Gender</b></td>
<td bgcolor="#dddddd">Female</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Ethnicity</b></td>
<td bgcolor="#dddddd">Unknown</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Date of Birth</b></td>
<td bgcolor="#dddddd">Jan 1st, 1901</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Class</b></td>
<td bgcolor="#dddddd">Sophomore</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>Major</b></td>
<td bgcolor="#dddddd">Biology</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>School</b></td>
<td bgcolor="#dddddd">University of Somewhere</td>
</tr>
<tr>
<td bgcolor="#dddddd"><b>GPA</b></td>
<td bgcolor="#dddddd">0.00</td>
</tr>
<tr>
<td bgcolor="#dddddd" valign="top"><b>Availability</b></td>
<td bgcolor="#dddddd">
<table border="0" cellspacing="0" cellpadding="0">
<tr>
I assume that there will be many "Recruit Profile" spans that are followed by tables that wrap up all the details. The following method takes your entire HTML page, finds just those spans, and for each of them it finds the following table and then finds the fields you want anywhere below that table:
require 'nokogiri'
# Pass in or set the array of labels you want to use
# Returns an array of hashes mapping these labels to the values
def recruits_details(html,fields=%W[Name #{"EDU ID"} Phone Email Gender])
doc = Nokogiri::HTML(html)
recruit_labels = doc.xpath('//span[b[text()="Recruit Profile"]]')
recruit_labels.map do |recruit_label|
recruit_table = recruit_label.at_xpath('following-sibling::table')
Hash[ fields.map do |field_label|
label_td = recruit_table.at_xpath(".//td[b[text()='#{field_label}']]")
[field_label, label_td.at_xpath('following-sibling::td/text()').text ]
end ]
end
end
require 'pp'
pp recruits_details(html_string)
#=> [{"Name"=>"Some Person",
#=> "EDU ID"=>"A12345678",
#=> "Phone"=>"123-456-7890",
#=> "Email"=>"someone@email.com",
#=> "Gender"=>"Female"}]
An XPath expression like .//foo[bar[text()="jim"]]
means:
- Find a 'foo' element anywhere under the current node
- ...but only if it has a 'bar' element as a child
- ...but only if that 'bar' element has the text "jim" as its content
An XPath expression like following-sibling::...
means Find any elements that are siblings after the current node that match the expression ...
The XPath expression .../text()
selects the Text node; the text
method is used to extract the value (actual string) of that text node.
Nokogiri's xpath
method returns an array of all elements matching the expression, while the at_xpath
method returns the first element matching the expression.
精彩评论