How do I monitor a "stuck" Python script?_问答_开发者

I have a data-intensive Python script that uses HTTP connections to download data. I usually run it overnight. Sometimes the connection will fail, or a website will be unavailable momentarily. I have basic error-handling that catches these exceptions and tries again periodically, exiting gracefully (and logging errors) after 5 minutes of retrying.

However, I've noticed that sometimes the job just freezes. No error is thrown, and the job is still running, sometimes hours after the last print message.

What is the best way to:

monitor a Python script,
detect if it is unresponsive after a given interval,
exit it if it is unresponsive,
and start another one?

UPDATE

Thank you all for your help. As a few of you have pointed out, the urllib and socket modules don't have timeouts set properly. I am using Python 2.5 with the Freebase and ur开发者_如何学Pythonllib2 modules, and catching and handling MetawebErrors and urllib2.URLErrors. Here is a sample of err output after the last script hung for 12 hours:

  File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/session.py", line 369, in _httpreq_json
    resp, body = self._httpreq(*args, **kws)
  File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/session.py", line 355, in _httpreq
    return self._http_request(url, method, body, headers)
  File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/httpclients.py", line 33, in __call__
    resp = self.opener.open(req)
  File "/usr/lib/python2.5/urllib2.py", line 381, in open
    response = self._open(req, data)
  File "/usr/lib/python2.5/urllib2.py", line 399, in _open
    '_open', req)
  File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.5/urllib2.py", line 1107, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.5/urllib2.py", line 1080, in do_open
    r = h.getresponse()
  File "/usr/lib/python2.5/httplib.py", line 928, in getresponse
    response.begin()
  File "/usr/lib/python2.5/httplib.py", line 385, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.5/httplib.py", line 343, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.5/socket.py", line 372, in readline
    data = recv(1)
KeyboardInterrupt

You'll notice the socket error at the bottom. Since I'm using Python 2.5 and don't have access to the third urllib2.urlopen option, is there another way to watch for and catch this error? For example, I'm catching URLErrrors - is there another type of error in urllib2 or socket that I can catch which will help me?

It sounds like there is a bug in your script. The answer is not to monitor the bug, but to hunt down the bug and fix it.

We can't help you find the bug without seeing some code. But as a general idea, you might want to use logging to pinpoint where the problem is occurring, and write unit tests to help you build confidence about which parts of your code do not have the bug.

Another idea is to break your "stuck" program with Ctrl-C and to study the traceback message. It will show you what line your program was last executing. That may give you a clue where the script is going wrong.

Since the program is doing web communication, I'd fire up a debugging proxy like Charles http://www.charlesproxy.com/ and see if there's anything kooky happening in the back-and-forth between your script and the server.

Also consider that the socket module has no timeout set by default and therefore can hang. As of python 2.6, however, you can pass a third argument to urllib2.urlopen (if you are using urllib2, that is), specifying a request timeout period in seconds. That way the script will error out rather than go catatonic waiting from a response from a perhaps uncooperative server. If you haven't already, I'd check these things out before trying anything more elaborate.

Update for python 2.5: To do this in python < 2.6, you would have to set the timeout value directly in the socket module, which urllib2 uses. I haven't tried this, but it presumably works. Found this info at http://www.voidspace.org.uk/python/articles/urllib2.shtml:

import socket
import urllib2

# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)

# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)

a simple way to do what you ask is to make use of UDP packets sent by your current program to another harvesting program that monitors the output. If it doesn't receive a packet in a certain amount of time, it kills the other python process then restarts another one

You could run your script in pdb and break in when you suspect it's frozen. It won't work on its own, but might help you figure out why it's freezing.