开发者

How can I use Ruby's scan method to parse an HTML table?

开发者 https://www.devze.com 2023-02-12 16:29 出处:网络
I am trying to take an HTML table and make an array of arrays, each array being a row, and each element in the array being one cell. Assuming I can break the whole table into its rows, I want to split

I am trying to take an HTML table and make an array of arrays, each array being a row, and each element in the array being one cell. Assuming I can break the whole table into its rows, I want to split each row up by the <td> tags. I have the following:

def get_cells(one_row)
cells = one_row.scan(/<td>.+?<\/td>/)
for c in cells 
    puts c
end
end

This is the HTML I am acting on, as a string called one_row:

<tr>
<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Flag_of_Kuwait.svg/22px-Flag_of_Kuwait.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Kuwait">Kuwait</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/22px-Flag_of_the_United_States.svg.png" width="22" height="12" class="thumbborder" />&#160;</span><a href="/wiki/United_States">United States</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Flag_of_Saudi_Arabia.svg/22px-Flag_of_Saudi_Arabia.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Saudi_Arabia">Saudi Arabia</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Flag_of_the_United_Kingdom.svg/22px-Flag_of_the_United_Kingdom.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/United_Kingdom">United Kingdom</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Flag_of_Egypt.svg/22px-Flag_of_Egypt.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Egypt">Egypt</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/22px-Flag_of_France.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/France">France</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Flag_of_Syria.svg/22px-Flag_of_Syria.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Syria">Syria</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Flag_of_Morocco.svg/22px-Flag_of_Morocco.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Morocco">Morocco</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Flag_of_Oman.svg/22px-Flag_of_Oman.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Oman">Oman</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Flag_of_Pakistan.svg/22px-Flag_of_Pakistan.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Pakistan">Pakistan</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Flag_of_Canada.svg/22px-Flag_of_Canada.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Canada">Canada</a><br />
<a href="/wiki/Coalition_of_Gulf_War" title="Coalition of Gulf War" class="mw-redirect">Other Coalition Forces</a></开发者_开发百科td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>
</tr>

However, when I call get_cells on this, it doesn't return an array with five elements. It returns an array with four elements:

<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>

It seems to be skipping what should be the fourth cell. That cell contains numerous elements, all separated by line breaks. Could that be what's messing this up? Any suggestions on how to approach this?


HTML is beyond the capability of regular expressions to parse reliably — it's pretty much never worth your time, even in simple caes. If you need to parse HTML, just use an HTML parser like Hpricot or Nokogiri. For example, Nokogiri(text).css('td').count gives 5, and Nokogiri(text).css('td').map(&:text) gives ["1990", "1991", "Gulf War", " Kuwait  United States  Saudi Arabia  United Kingdom  Egypt  France  Syria  Morocco  Oman  Pakistan  Canada Other Coalition Forces", " Iraq"].


Yes, it's the line breaks. The . (dot) metacharacter doesn't match them by default, but you can change that by adding the /m ("multiline") modifier:

/<td>.+?<\/td>/m

FYI, most other regex flavors (Perl, Python, .NET, etc.) call this "single-line" or "dot-matches-all" mode, and use /s for it. They use the /m modifier to change the meaning of the ^ and $ anchors, allowing them to match at line boundaries and not just at the beginning and end of the text. In Ruby, ^ and $ always work that way, so no separate mode is needed.


A parser is always a much better way to go for anything but the most trivial jobs, when dealing with XML or HTML.

Nokogiri is my parser of choice. It supports both XPath expressions and CSS accessors. CSS usually results in a simpler search, and is more familiar to people who write CSS. XPath is more expressive and can do some pretty amazing searches in the parser (libxml2 in Nokogiri's case), that can replace a lot of Ruby code.

Here's how I'd go after your data:

html = <<EOT
<tr>
<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Flag_of_Kuwait.svg/22px-Flag_of_Kuwait.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Kuwait">Kuwait</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/22px-Flag_of_the_United_States.svg.png" width="22" height="12" class="thumbborder" />&#160;</span><a href="/wiki/United_States">United States</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Flag_of_Saudi_Arabia.svg/22px-Flag_of_Saudi_Arabia.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Saudi_Arabia">Saudi Arabia</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Flag_of_the_United_Kingdom.svg/22px-Flag_of_the_United_Kingdom.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/United_Kingdom">United Kingdom</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Flag_of_Egypt.svg/22px-Flag_of_Egypt.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Egypt">Egypt</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/22px-Flag_of_France.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/France">France</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Flag_of_Syria.svg/22px-Flag_of_Syria.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Syria">Syria</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Flag_of_Morocco.svg/22px-Flag_of_Morocco.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Morocco">Morocco</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Flag_of_Oman.svg/22px-Flag_of_Oman.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Oman">Oman</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Flag_of_Pakistan.svg/22px-Flag_of_Pakistan.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Pakistan">Pakistan</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Flag_of_Canada.svg/22px-Flag_of_Canada.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Canada">Canada</a><br />
<a href="/wiki/Coalition_of_Gulf_War" title="Coalition of Gulf War" class="mw-redirect">Other Coalition Forces</a></td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>
</tr>
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html)

# for Ruby 1.8.7+
data = doc.css('tr').map { |tr| tr.css('td').map { |td| td.text } } 

# for Ruby 1.9+
data = doc.css('tr').map { |tr| tr.css('td').map(&:text) } 

# or using XPath
data = doc.search('//tr').map { |tr| tr.search('td').map { |td| td.text } } 

pp data
# >> [["1990",
# >>   "1991",
# >>   "Gulf War",
# >>   " Kuwait United States Saudi Arabia United Kingdom Egypt France Syria Morocco Oman Pakistan CanadaOther Coalition Forces",
# >>   " Iraq"]]


I'd try Nokogiri and SelectorGadget. It is a good video showing how to do it at http://railscasts.com

0

精彩评论

暂无评论...
验证码 换一张
取 消