Web crawler in Rails to extract links and download files from web page_问答_开发者

Web crawler in Rails to extract links and download files from web page

开发者 https://www.devze.com 2023-02-08 07:08 出处：网络

I\'m using RoR, I will specify a link to a web page in my application and here are the things that I want to do

I'm using RoR, I will specify a link to a web page in my application and here are the things that I want to do

(1) I want to extract all the links in the web page

(2) Find if they are links to pdf file(basically a pattern match)

(3)I want to download file in link(a pdf for example) and store them on my system.

I tried using Anemone, but it crawls the entire website which overshoots my n开发者_Python百科eeds and also how do I download the files in corresponding links?

Cheers

Have a look at Nokogiri aswell.

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.thatwebsite.com/downloads'))

doc.css('a').each do |link|
  if link['href'] =~ /\b.+.pdf/
    begin
      File.open('filename_to_save_to.pdf', 'wb') do |file|
        downloaded_file = open(link['href'])
        file.write(downloaded_file.read())
      end
    rescue => ex
      puts "Something went wrong...."
    end
  end
end

You might want to do some better exception catching, but I think you get the idea :)

Have you tried scrapi? You can scrape the page with css selectors.

Ryan Bates also made a screencast about it.

To download the files you can use open-uri

require 'open-uri'  
url = "http://example.com/document.pdf"
file = open(url)  
c = file.read()

Web crawler in Rails to extract links and download files from web page

精彩评论

关注公众号

热门标签

图文推荐

Web crawler in Rails to extract links and download files from web page

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：