I have a python script that performs URL requests using the urllib2. I have a pool of 5 processes that run asynchronously and perform a function. This function is the one that makes the url calls, gets data, parses it into the required format, performs calculations and inserts data. The amount of data varies for each url request.
I run this script every 5 minutes using a cron job. Sometimes when i do ps -ef | grep python
, I see stuck processes. Is there a way where in I can keep track of the processes meaning within the multiprocessing class that can keep track of the processes, their state meaning completed, stuck or dead and so on? Here is a code snippet:
This is how i call async processes
pool = Pool(proce开发者_JS百科sses=5)
pool.apply_async(getData, )
And the following is a part of getData which performs urllib2 requests:
try:
Url = "http://gotodatasite.com"
data = urllib2.urlopen(Url).read().split('\n')
except URLError, e:
print "Error:",e.code
print e.reason
sys.exit(0)
Is there a way to track stuck processes and rerun them again?
Implement a ping mechanism if you are so inclined to use multiprocessing. You're looking for processes that have become stuck because of slow I/O, I assume?
Personally I would go with a queue (not necessarily a queue server), say for example ~/jobs
is a list of URLs to work on, then have a program that takes the first job and performs it. Then it's just a matter of bookkeeping - say, have the program note when it was started and what its PID is. If you need to kill slow jobs, just kill the PID and mark the job as failed.
Google for urllib2 and timeout. If the timeout is reached you get an exception, and the process is not stuck anymore.
精彩评论