I need to build a small "monitoring" scraper for a 3rd party website (it's an external website that has stats about our visitors).
Unfortunately, this website is very hard to scrape through the normal "wget" mechanism, because it uses a ton of sophisticated JS, part of it generated by GWT. So my workaround was to create a GreaseMonkey script and then hav开发者_开发技巧e this script call a PHP page that would log the scraped data. Then as soon as Firefox starts with this webpage-to-scrape, the script goes to work.
This works well, but now I am trying to make it more robust as far as monitoring tools go. I want it to run on the server using a cron job. As far as I understand such things, this requires a DISPLAY variable to be set and for an X session to exist (Firefox is refusing to run for me). Is there any nice way to allow it to run from the batchuser account as a cron job?
I've done something similar to get Selenium running headless on a server. I used Xvfb.
http://en.wikipedia.org/wiki/Xvfb
This article has some tips for using Xvfb with Firefox:
http://semicomplete.com/blog/geekery/xvfb-firefox.html
The best way to do that is to build Firefox in the headless mode: http://hg.mozilla.org/incubator/offscreen
精彩评论