How do I extract text from a web page with tags using Hpricot?_问答_开发者

How do I extract text from a web page with tags using Hpricot?

开发者 https://www.devze.com 2023-01-29 18:55 出处：网络

I\'m trying to parse an HTML file using Hpricot and Ruby, but I\'m having issues extracting \"free floating\" text which is not enclosed in tags like .

相关专题：hpricot ruby

I'm trying to parse an HTML file using Hpricot and Ruby, but I'm having issues extracting "free floating" text which is not enclosed in tags like .

require 'hpricot'

text = 开发者_StackOverflow社区<<SOME_TEXT
  <a href="http://www.somelink.com/foo/bar.html">Testing:</a><br />
  line 1<br />  
  line 2<br />
  line 3<br />
  line 4<br />
  line 5<br />
  <b>Here's some more text</b>
SOME_TEXT

parsed = Hpricot(text)

parsed = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed

I would expect the result to be

<br />
line 1<br />  
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>

But I am getting

<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>

How can I make Hpricot return line 1, line 2, etc?

Your first step is to read the following_siblings documentation:

Find sibling elements which follow the current one. Like the other “sibling” methods, this weeds out text and comment nodes.

Then you should use the Hpricot source to generalize how following_siblings works to get something that works like following_siblings but doesn't filter out non-container nodes:

parsed        = Hpricot(text)
link          = parsed.search('//a[@href="http://www.somelink.com/foo/bar.html"]').first
link_sibs     = link.parent.children
what_you_want = link_sibs[link_sibs.index(link) + 1 ... link_sibs.length]

puts what_you_want

That's pretty much following_siblings with parent.children instead of parent.containers. Having access to the source code of the libraries you use is pretty handy and studying it is to be encouraged.

It's been a while since I've used Hpricot but here's some things I remember that might help:

The quick way to get all the text:

irb(main):023:0> print parsed.inner_text
  Testing:
  line 1  
  line 2
  line 3
  line 4
  line 5
  Here's some more text

The downside to that is you get the text embedded in tags too.

Similarly, we can search for all 'text()' nodes:

irb(main):033:0> puts (parsed / 'text()')

Testing:

  line 1

  [...]

  line 5

So, we can do this:

irb(main):036:0> puts (parsed / 'text()')[2 .. -3]

  line 1

  line 2

  line 3

  line 4

  line 5

or:

irb(main):037:0> (parsed / 'text()')[2 .. -3]
=> #<Hpricot::Elements["\n  line 1", "  \n  line 2", "\n  line 3", "\n  line 4", "\n  line 5", "\n  "]>

or:

irb(main):039:0> (parsed / 'text()')[2 .. -3].map{ |t| t.inner_text.strip }
=> ["line 1", "line 2", "line 3", "line 4", "line 5", ""]

The main idea for grabbing data/text from a web page is look for landmarks you can use to navigate through the page. Often we can grab text from inside a <div> or  tag. If a page doesn't give you landmarks you have to use other tricks; Looking for a series of text nodes followed by   nodes maybe, or the five lines following an <a> tag with a certain href attribute. That's the fun and challenge of dealing with HTML.

In the back of my mind there's a nagging thought that there is a more elegant way to do this, but this seems to be working. Dig around on the Hpricot Challenge page for variations on themes on digging out content.