I've written a crawler that uses urllib2 to fetch URLs.
every few requests I get some weird behaviors, I've tried analyzing it with Wireshark and couldn't understand the problem.
getPAGE()
is responsible for fetching the URL.
it returns the content of the URL (response.read()) if it successfully fetches the URL, else it returns None.
def getPAGE(FetchAddress):
attempts = 0
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'}
while attempts < 2:
req = Request(FetchAddress, None ,headers)
try:
response = urlopen(req) #fetching the url
except HTTPError, e:
print 'The server didn\'t do the request.'
print 'Error code: ', str(e.code) + " address: " + FetchAddress
开发者_Python百科 time.sleep(4)
attempts += 1
except URLError, e:
print 'Failed to reach the server.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except Exception, e:
print 'Something bad happened in gatPAGE.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
else:
return response.read()
return None
This is the function that calls getPAGE()
and checks if the the page I've fetched is valid (checking with - companyID = soup.find('span',id='lblCompanyNumber').string
If companyID is None the page is not valid), if the page is valid it saves the soup object to a global variable named 'curRes'.
def isValid(ID):
global curRes
try:
address = urlPath+str(ID)
page = getPAGE(address)
if page == None:
saveToCsv(ID, badRequest = True)
return False
except Exception, e:
print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
else:
try:
soup = BeautifulSoup(page)
except TypeError, e:
print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
return False
try:
companyID = soup.find('span',id='lblCompanyNumber').string
if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
saveToCsv(ID, isEmpty = True)
return False
else:
curRes = soup #we have the data we need, save the soup obj to a global variable
return True
except Exception, e:
print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
return False
the strange behaviors are -
- there are times that urllib2 executes a GET request and without waiting for the reply it sends the next GET request (ignoring the last request)
- sometimes I get "[errno 10054] An existing connection was forcibly closed by the remote host" after the code is simply stuck for about 20 minutes or so waiting for a response from the server, while it stucks I copy the URL and try to fetch it manually and I get a response in less than 1 sec (?).
- getPAGE() function will return None to isValid() if it failed to return the url, sometimes I get the Error -
Error while parsing this page, third exception block: 'NoneType' object has no attribute 'string' id:....
that's weird because I'm creating the soup object just if I got a valid result from getPAGE(), and it seems that the soup function is returning None, which is raising an exception whenever I try to run
companyID = soup.find('span',id='lblCompanyNumber').string
the soup object should never be None, it should get the HTML from getPAGE() if it reaches that part of the code
I've checked and saw that the problem is somehow connected to the first problem (sending GET and not waiting for the reply, I saw (on WireShark) that each time I got that exception it was for a url that urllib2 sent a GET request but didn't wait for the response and moved on, getPAGE() should have returned None for that url, but if it would return None isValid(ID) wouldn't pass the "if page == None:" condition, I can't find out why it is happening, it's impossible to replicate the issue.
I've read that time.sleep() can cause issues with urllib2 threading, so maybe I should avoid using it?
why doesn't urllib2 always wait for the response (it happens rarely that it doesn't wait)?
what can I do about the "[errno 10054] An existing connection was forcibly closed by the remote host" Error? BTW - the exception isn't caught by getPAGE() try: except block, it is caught by the first isValid() try: except: block, which is also weird cause getPAGE() suppose to catch all the exceptions it throws.
try:
address = urlPath+str(ID)
page = getPAGE(address)
if page == None:
saveToCsv(ID, badRequest = True)
return False
except Exception, e:
print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
Thanks!
精彩评论