开发者

Construct URLs after scraping for image paths

开发者 https://www.devze.com 2023-03-03 10:36 出处:网络
I\'m trying to scrape a web URL inputed by the user and then output an array of valid non-broken image elements with absolute paths in HTML. I\'m using Nokogiri for scraping and I want to know if ther

I'm trying to scrape a web URL inputed by the user and then output an array of valid non-broken image elements with absolute paths in HTML. I'm using Nokogiri for scraping and I want to know if there is anything I can use to easily process the unpredicatble URLs provided by user and image paths scraped short of figuring out how to write something from scratch.

Examples:

http://domain.com/ and /system/images/image.png
=> http://domain.com/system/images/image.png

http://sub.domain.com and images/common/image.png
=> http://sub.domain.com/images/common/image.png

http://domain.com/dir/ and images/image.png
=> http://domain.com/dir/images/image.png

http://domain.com/dir and /images/small/image.png
=> http://domain.com/images/small/image.png

http://domain.com and http://s3.amazon-aws.com/bucket/image.png
=>开发者_JS百科; http://s3.amazon-aws.com/bucket/image.png 


Instead of downloading the pages and using Nokogiri, I would recommend using Mechanize. It is built on top of Nokogiri, so everything you can do with Nokogiri you can do with Mechanize, but it adds a lot of useful functionality for scraping/navigating. It will take care of the relative URL problem you describe above.

require 'rubygems'
require 'mechanize'
url='http://stackoverflow.com/questions/5903218/construct-urls-after-scraping-for-image-paths/5903417'
Mechanize.new.get(url) {|page| puts page.image_urls.join "\n"}


If you really want to do it yourself (instead of using Mechanize, say), use URI::join:

require 'uri'
URI::join("http://domain.com/dir", "/images/small/image.png")
  # => http://domain.com/images/small/image.png

Note that you have to respect the HTML page's BASE tag if there is one...

0

精彩评论

暂无评论...
验证码 换一张
取 消