I'm using RoR, I will specify a link to a web page in my application and here are the things that I want to do
(1) I want to extract all the links in the web page
(2) Find if they are links to pdf file(basically a pattern match)
(3)I want to download file in link(a pdf for example) and store them on my system.
I tried using Anemone, but it crawls the entire website which overshoots my n开发者_Python百科eeds and also how do I download the files in corresponding links?
Cheers
Have a look at Nokogiri aswell.
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.thatwebsite.com/downloads'))
doc.css('a').each do |link|
if link['href'] =~ /\b.+.pdf/
begin
File.open('filename_to_save_to.pdf', 'wb') do |file|
downloaded_file = open(link['href'])
file.write(downloaded_file.read())
end
rescue => ex
puts "Something went wrong...."
end
end
end
You might want to do some better exception catching, but I think you get the idea :)
Have you tried scrapi? You can scrape the page with css selectors.
Ryan Bates also made a screencast about it.
To download the files you can use open-uri
require 'open-uri'
url = "http://example.com/document.pdf"
file = open(url)
c = file.read()
精彩评论