开发者

How should I schedule many Google Search scrapes over the course of a day?

开发者 https://www.devze.com 2023-02-05 06:07 出处:网络
Currently, my Nokogiri script iterates through Google\'s SERPs until it finds the position of the target website. It does this for each keyword for each website that each user specifies (users are cap

Currently, my Nokogiri script iterates through Google's SERPs until it finds the position of the target website. It does this for each keyword for each website that each user specifies (users are capped on amount of websites & keywords they can track).

Right now, it's run in a rake that's hard-scheduled every day and batches all scrapes at once by looping through all the websites in the database. But I'm concerned about scalability and swarming Google with a batch of requests.

I'd like a solution that scales and can run these scrapes over the course of the day. I'm not sure what kind of solution is available or what I'm really looking for.

Note: The amount of websites/keywords change from day to day as users add and delete their websites and keywords. I don't mean to make this question too su开发者_如何学JAVAperfluous, but is this the kind of thing Beanstalkd/Stalker (job queuing) can be used for?


You will have to balance two issues: Scalability for lots of users versus Google shutting you down for scaping in violation of their terms of use.

So your system will need to be able to distribute tasks to various different IPs to conceal your bulk scraping which suggests at least two levels of queuing. One to manage all the jobs and send them to each separate IP for subsequent searching and collecting results and queues on each separate machine to hold the requested searches until they are executed and the results returned.

I have no idea what Google's thresholds are (I am sure they don't advertise it) but exceeding them and getting cut off would obviously be devastating for what you are trying to do so your simple looping rake task is exactly what you shouldn't do after a certain number of users.

So yes, use a queue of some sort but realize that you probably have a different goal from the typical goal of a queue in that you want to deliberately delay jobs rather that offload word to avoid UI delays. So you will be seeking ways to slow down the queue rather than have it just execute job after job as they arrive in the queue.

So based on a cursory inspection of DelayedJob and BackgroundJobs it looks like DelayedJob has what you would need with the run_at attribute. But I am only speculating here and I am sure an expert would have more to say.


If I'm understanding correclty, it sounds like one of these tools might fit the bill:

Delayed_job: https://github.com/tobi/delayed_job

or

BackgroundJobs: http://codeforpeople.rubyforge.org/svn/bj/trunk/README

I've used both of them, and found them easy to work with.


There are definitely some background job libraries that might work.

  • delayed_job: https://github.com/collectiveidea/delayed_job (beware of the unmaintained branch from tobi!)
  • resque: https://github.com/defunkt/resque

However, you might think about just scheduling a Cron job that runs more times during the day, and processes less items per run.


SaaS solution: http://momentapp.com/ "Launch delayed jobs with scheduled http requests" - disclaimer a) in beta b) I am not affiliated with this service

0

精彩评论

暂无评论...
验证码 换一张
取 消