开发者

Getting all the domains a page depends on using Nokogiri

开发者 https://www.devze.com 2023-03-24 00:25 出处:网络
I\'m trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can\'t be perfect because of Javascript dynamically loading dependencies but I\'m happy with a

I'm trying to get all of the domains / ip addresses that a particular page depends on using Nokogiri. It can't be perfect because of Javascript dynamically loading dependencies but I'm happy with a best effort at getting:

  • Image URLs <img src="..."
  • Javascript URLs <script src="..."
  • CSS and any CSS url(...) elements
  • Frames and IFrames

I'd also want to follow any CSS imports.

Any suggestions / help would be appreciated. The project is already using Anemone.

Here's what I have at the moment.

Anemone.crawl(site, :depth_limit => 1) do |anemone|
  anemone.on_every_page do |page|
    page.doc.xpath('//img').each do |link|
      process_dependency(page, link[:src])
    end
    page.doc.xpath('//script').each do |link|
      process_dependency(page, link[:src])
    end
    page.doc.xpath('//link').each do |link|
      process_dependency(page, link[:href])
    end
    puts page.url
  end
end

Code would be great but I'm rea开发者_运维百科lly just after pointers e.g. I have now discovered that I should use a css parser like css_parser to parse out any CSS to find imports and URLs to images.


Get the content of the page, then you can extract an array of URIs from the page with

require 'uri'    
URI.extract(page)

After that it's just a matter of using a regular expression to parse each link and extract the domain name.

0

精彩评论

暂无评论...
验证码 换一张
取 消