I am downloading URLs in Python and need to detect 404s, so after some search I came up with:
import urllib
class MyUrlOpener(urllib.FancyURLopener):
def retrieve(self, url, filename=None, reporthook=None, data=None):
self.file_was_found = True
val = urllib.FancyURLopener.retrieve(self, url, filename, reporthook, data)
return val
def http_error_404(url, fp, errcode, errmsg, headers, data):
url.file_was_found = False
def download_file(url, saveas):
urlaccess = MyUrlOpener()
localFile, headers = urlaccess.retrieve(url, saveas)
return urlaccess.file_was_found
My question is that if you look at the source code (Python 2.7) for FancyURLopener then you see:
def http_error(self, url, fp, errcode, errmsg, headers, 开发者_运维知识库data=None):
"""Handle http errors.
Derived class can override this, or provide specific handlers
named http_error_DDD where DDD is the 3-digit error code."""
# First check if there's a specific handler for this error
name = 'http_error_%d' % errcode
if hasattr(self, name):
method = getattr(self, name)
if data is None:
result = method(url, fp, errcode, errmsg, headers)
else:
result = method(url, fp, errcode, errmsg, headers, data)
if result: return result
return self.http_error_default(url, fp, errcode, errmsg, headers)
Which is passing the url
as the first parameter and not self
. I thought that the first parameter to a function was always a reference to the class instance (by convention) and my code confirms this. So what happens to the url
value?
UPDATE: It turns out that data==None
so it was calling the first signature. This foiled my attempts to manually add the self parameter. As soon as I added the =None
default to data
in my http_error_404
signature all was well (because it used the default).
The fixed / correct signature is def http_error_404(self, url, fp, errcode, errmsg, headers, data=None):
In Python, any class instance's method has self
passed in by the Python interpreter and all of the other arguments are shifted down one place automatically.
In other words the Python interpreter rewrites:
urlaccess.retrieve(url, saveas)
into something that looks like this:
urlaccess.retrieve(urlaccess, url, saveas)
So you don't have to do it yourself. However, since
explicit is better than implicit
any instance methods you declare for a Python object must specify explicitly that they take the instance of the object as their first argument even though Python will pass that argument without any action on the part of the programmer.
The first argument does not have to be called self
... that is only a convention.
So, to actually answer your question though (as mluebke did) -- you need to specify the self
argument.
def http_error_404(url, fp, errcode, errmsg, headers, data):
url.file_was_found = False
# Python is treating `url` as `self`
# Therefore the URL is being saved in `fp`, `fp` in `errcode`, etc.
To fix this problem add a first argument to pick up the instance.
def http_error_404(self, url, fp, errcode, errmsg, headers, data):
self.file_was_found = False
# Now everything should work
self is explicitly listed in the method definition, but implicitly passed when the method is called. Change your function to look like this and all your variables will start to line up again.
def http_error_404(self, url, fp, errcode, errmsg, headers, data):
self.file_was_found = False
精彩评论