开发者

Easiest way to fetch all href contents on page in Ruby?

开发者 https://www.devze.com 2022-12-09 21:55 出处：网络

I\'m writing a simple web crawler in Ruby and I need to fetch all href contents on the page. What is the best way to do this, or any other web page source parsing, since some pages might not be valid,

相关专题：html-parsing parsing regex ruby

I'm writing a simple web crawler in Ruby and I need to fetch all href contents on the page. What is the best way to do this, or any other web page source parsing, since some pages might not be valid, but I still want to be able to parse them.

Are there any good Ruby HTML p开发者_开发问答arsers that allow validity agnostic parsing, or is the best way just to do it by hand with regexp?

Is it possible to use XPath on non-XHTML page?

Have a look at Nokogiri. Short example:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
doc.search('//*[@href]').each do |m| p m[:href] end

Take a look at Mechanize. I'm pretty sure it has methods for grabbing all links in a page.