I'm trying to parse an HTML file using Hpricot and Ruby, but I'm having issues extracting "free floating" text which is not enclosed in tags like <p></p>
.
require 'hpricot'
text = 开发者_StackOverflow社区<<SOME_TEXT
<a href="http://www.somelink.com/foo/bar.html">Testing:</a><br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
SOME_TEXT
parsed = Hpricot(text)
parsed = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed
I would expect the result to be
<br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
But I am getting
<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>
How can I make Hpricot return line 1, line 2, etc?
Your first step is to read the following_siblings documentation:
Find sibling elements which follow the current one. Like the other “sibling” methods, this weeds out text and comment nodes.
Then you should use the Hpricot source to generalize how following_siblings
works to get something that works like following_siblings
but doesn't filter out non-container nodes:
parsed = Hpricot(text)
link = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first
link_sibs = link.parent.children
what_you_want = link_sibs[link_sibs.index(link) + 1 ... link_sibs.length]
puts what_you_want
That's pretty much following_siblings
with parent.children
instead of parent.containers
. Having access to the source code of the libraries you use is pretty handy and studying it is to be encouraged.
It's been a while since I've used Hpricot but here's some things I remember that might help:
The quick way to get all the text:
irb(main):023:0> print parsed.inner_text
Testing:
line 1
line 2
line 3
line 4
line 5
Here's some more text
The downside to that is you get the text embedded in tags too.
Similarly, we can search for all 'text()'
nodes:
irb(main):033:0> puts (parsed / 'text()')
Testing:
line 1
[...]
line 5
So, we can do this:
irb(main):036:0> puts (parsed / 'text()')[2 .. -3]
line 1
line 2
line 3
line 4
line 5
or:
irb(main):037:0> (parsed / 'text()')[2 .. -3]
=> #<Hpricot::Elements["\n line 1", " \n line 2", "\n line 3", "\n line 4", "\n line 5", "\n "]>
or:
irb(main):039:0> (parsed / 'text()')[2 .. -3].map{ |t| t.inner_text.strip }
=> ["line 1", "line 2", "line 3", "line 4", "line 5", ""]
The main idea for grabbing data/text from a web page is look for landmarks you can use to navigate through the page. Often we can grab text from inside a <div>
or <p>
tag. If a page doesn't give you landmarks you have to use other tricks; Looking for a series of text nodes followed by <br>
nodes maybe, or the five lines following an <a>
tag with a certain href
attribute. That's the fun and challenge of dealing with HTML.
In the back of my mind there's a nagging thought that there is a more elegant way to do this, but this seems to be working. Dig around on the Hpricot Challenge page for variations on themes on digging out content.
精彩评论