Construct URLs after scraping for image paths_问答_开发者

Construct URLs after scraping for image paths

开发者 https://www.devze.com 2023-03-03 10:36 出处：网络

I\'m trying to scrape a web URL inputed by the user and then output an array of valid non-broken image elements with absolute paths in HTML. I\'m using Nokogiri for scraping and I want to know if ther

I'm trying to scrape a web URL inputed by the user and then output an array of valid non-broken image elements with absolute paths in HTML. I'm using Nokogiri for scraping and I want to know if there is anything I can use to easily process the unpredicatble URLs provided by user and image paths scraped short of figuring out how to write something from scratch.

Examples:

http://domain.com/ and /system/images/image.png
=> http://domain.com/system/images/image.png

http://sub.domain.com and images/common/image.png
=> http://sub.domain.com/images/common/image.png

http://domain.com/dir/ and images/image.png
=> http://domain.com/dir/images/image.png

http://domain.com/dir and /images/small/image.png
=> http://domain.com/images/small/image.png

http://domain.com and http://s3.amazon-aws.com/bucket/image.png
=>开发者_JS百科; http://s3.amazon-aws.com/bucket/image.png

Instead of downloading the pages and using Nokogiri, I would recommend using Mechanize. It is built on top of Nokogiri, so everything you can do with Nokogiri you can do with Mechanize, but it adds a lot of useful functionality for scraping/navigating. It will take care of the relative URL problem you describe above.

require 'rubygems'
require 'mechanize'
url='http://stackoverflow.com/questions/5903218/construct-urls-after-scraping-for-image-paths/5903417'
Mechanize.new.get(url) {|page| puts page.image_urls.join "\n"}

If you really want to do it yourself (instead of using Mechanize, say), use URI::join:

require 'uri'
URI::join("http://domain.com/dir", "/images/small/image.png")
  # => http://domain.com/images/small/image.png

Note that you have to respect the HTML page's BASE tag if there is one...

Construct URLs after scraping for image paths

精彩评论

关注公众号

热门标签

图文推荐

Construct URLs after scraping for image paths

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：