I have a complex screen-scraping script that I've put together that uses Selenium2, the Selenium web driver and PHP binding script, so at the end of it all, I have a PHP script that drives Selenium, which in turn fetches a URL, parses some Javascript, fills out a form, blah blah blah, and then returns the HTML that is ultimately what I'm after. It all works great on my local computer (as a development and proof-of-concept environment).
So.
For production, I need this script to run automatically three times every day. I am trying to figure out if it would be better for me to set up everything on my server (meaning: figure out how to get Firefox for Linux going, then Java, then Selenium2, etc, etc... not trivial for me; Damn it Jim, I'm a coder, not a sysadmin!), or if I can use a 3rd-party Selenium testing service like Sauce Labs' OnDem开发者_JS百科and, or any of these other cloud-based Selenium services.
Those 3rd party solutions seem like they're all set up for "unit testing," which is totally not what I'm doing. I don't know about that stuff, or using PHPUnit, or doing tests with builds, or whatever. I just want to run my straightforward PHP script 3x/day and have it talk to Selenium to drive a browser and do my screen scraping.
Are one of those 3rd party solutions a good idea for what I'm trying to accomplish, or are they overkill/too far away from my (relatively simple) goal?
First, I want to let you know that I use Selenium with Ruby so I am assuming that running your php script will start up the selenium webdriver and run your tests... I will just explain how easily run your script 3 times a day without needing to be a sysadmin master.
Linux has an extremely stable and robust command called cron which is what you will need to use. It allows you to schedule actions to happen daily/hourly/whatever.
The first thing you want to do is to go to the directory with your script. I will refer to your script as script.php. First thing is to make sure that the top line of your script is:
#!/usr/bin/php
In the directory you will execute the following command to make your file accessible by the system:
chmod +x script.php
Now set up your cron job with the following command:
crontab -e
Then put in your job:
00 4,12,20 * * * /home/sean/script.php
00 - Means at 00 minutes.
4,12,20 - Are the hours (it is a 24 hour clock.)
The first: * - Every day
The second: * - Every month
The third: * - Every Day of the week
So this script would run every day, every week, every month at 4,noon and 8pm.
Obviously change the directory to the script on your system and set the times to whenever you want the scraping to occur.
I hope this helps!
-Appended stuff for the java/firefox-
First off, take this all with a grain of salt since I am using Ruby :)
Okay to get java/firefox running you will probably want to grab the selenium standalone. You can grab it here.
Then to run the selenium server you just:
java -jar selenium-server-standalone-2.5.0.jar
You can run put the standalone server starting in the cron job and then close it in your script file.
精彩评论