I'm trying to scrape a web URL inputed by the user and then output an array of valid non-broken image elements with absolute paths in HTML. I'm using Nokogiri for scraping and I want to know if there is anything I can use to easily process the unpredicatble URLs provided by user and image paths scraped short of figuring out how to write something from scratch.
Examples:
http://domain.com/ and /system/images/image.png
=> http://domain.com/system/images/image.png
http://sub.domain.com and images/common/image.png
=> http://sub.domain.com/images/common/image.png
http://domain.com/dir/ and images/image.png
=> http://domain.com/dir/images/image.png
http://domain.com/dir and /images/small/image.png
=> http://domain.com/images/small/image.png
http://domain.com and http://s3.amazon-aws.com/bucket/image.png
=>开发者_JS百科; http://s3.amazon-aws.com/bucket/image.png
Instead of downloading the pages and using Nokogiri, I would recommend using Mechanize. It is built on top of Nokogiri, so everything you can do with Nokogiri you can do with Mechanize, but it adds a lot of useful functionality for scraping/navigating. It will take care of the relative URL problem you describe above.
require 'rubygems'
require 'mechanize'
url='http://stackoverflow.com/questions/5903218/construct-urls-after-scraping-for-image-paths/5903417'
Mechanize.new.get(url) {|page| puts page.image_urls.join "\n"}
If you really want to do it yourself (instead of using Mechanize, say), use URI::join
:
require 'uri'
URI::join("http://domain.com/dir", "/images/small/image.png")
# => http://domain.com/images/small/image.png
Note that you have to respect the HTML page's BASE
tag if there is one...
精彩评论