How does scraperwiki decides to stop a sched开发者_Go百科uled run? Is it based on the actual execution time or the CPU time ? Or maybe something else.
I scrape a site for which Mechanize requires 30s to load every page but I use very few CPU to process the pages, so I wonder if the slowlyness of the server is a major issue.
CPU time, not wall clock time. It's based on the Linux function setrlimit.
Each scraper run has a limit of roughly 80 seconds of processing time. After that, in Python and Ruby you will get an exception "ScraperWiki CPU time exceeded". In PHP it will end "terminated by SIGXCPU".
In many cases this happens when you are first scraping a site, catching up with the backlog of existing data. The best way to handle it is to make your scraper do a chunk at a time using the save_var and get_var functions (see http://scraperwiki.com/docs/python/python_help_documentation/) to remember your place.
That also lets you recover more easily from other parsing errors.
精彩评论