I'm after a little advice around using Cron jobs with PHP. My scenario is this:
I have a website with a large membership. Users have one or several URLS associated with their account. At midnight (or a certain time) I'd like to call a script which will query the websites for each user and update the database with the information it finds. Think of it as a sort of screen scraper service.
My question is around the stress of the server. I'll be testing this new feature on the shared server, but ultimately I will be moving to a dedicated server.
So if the c.5000 membership h开发者_运维百科ave 2 URLS each - that's 10,000 websites it would query. What do people think is the best way to do this? Have a cron job that runs the first 500 members - then 10 minutes later run the next 500 etc etc...
or is there some magic which I've not heard of which might help!?
Thanks for any tips!
cron is a great tool to use for basic concepts like this. However, it scales poorly, as you've surmised! Look into job processing tools, like the open-source (and multi-language) Gearman:
http://gearman.org/
This should be a more robust system for the task at hand.
I would schedule a script daily, let the script query the 10,000 websites just one after another. Just one script that loops over all the websites and send a request and process the results one by one. For this kind of numbers there's no need make in any more difficult, imho.
As suggested already you could run the URL script all in one go sequentially. That's the simplest approach.
If that's not fast enough you could easily modify your cron script so that you can invoke it run on odd/even numbers. Run the script twice starting at midnight, once for odds, once for evens and as long as you don't exhaust any resources on the machine it should run twice as fast.
In terms of implementing this I would consider having the script accept two integer values which let you define the modulus and remainder. E.g. for odd even you define "2 0" and "2 1" which would result in something like SELECT * FROM myTable WHERE id % 2 == 0
and SELECT * FROM myTable WHERE id % 2 == 1
being executed against the SQL database. Using this approach it'd be very easy to configure any number of jobs to run in parallel.
gearmand is very powerful and I have used it on a number of projects but there's a bigger learning curve with it. I think the simple solution I suggested should get you by.
精彩评论