开发者

Python urllib2, how to avoid errors - need help

开发者 https://www.devze.com 2023-01-27 12:13 出处:网络
I am using python urllib2 to download pages from the web. I am not using any kind of user_agent etc. I am getting below sample errors. Can someone tell me a easy way to avoid them.

I am using python urllib2 to download pages from the web. I am not using any kind of user_agent etc. I am getting below sample errors. Can someone tell me a easy way to avoid them.

http://www.rottentomatoes.com/m/foxy_brown/
The server couldn't fulfill the request.
Error code:  403


http://www.spiritus-temporis.com/m开发者_JAVA技巧arc-platt-dancer-/
The server couldn't fulfill the request.
Error code:  503

http://www.golf-equipment-guide.com/news/Mark-Nichols-(golfer).html!!
The server couldn't fulfill the request.
Error code:  500


http://www.ehx.com/blog/mike-matthews-in-fuzz-documentary!!
We failed to reach a server.
Reason:  timed out
IncompleteRead(5621 bytes read)
Traceback (most recent call last):
    File "download.py", line 43, in <module>
    localFile.write(response.read())
    File "/usr/lib/python2.6/socket.py", line 327, in read
    data = self._sock.recv(rbufsize)
    File "/usr/lib/python2.6/httplib.py", line 517, in read
    return self._read_chunked(amt)
    File "/usr/lib/python2.6/httplib.py", line 563, in _read_chunked
    raise IncompleteRead(value)
IncompleteRead: IncompleteRead(5621 bytes read)

Thank you

Bala


Many web resources require some kind of cookie or other authentication to access, your 403 status codes are most likely the result of this.

503 errors tend to mean you're rapidly accessing resources from a server in a loop and you need to wait briefly before attempting another access.

The 500 example doesn't even appear to exist...

The timeout error may not need the "!!", I can only load the resource without it.

I recommend you read up on http status codes.


For those more complicated tasks, You might want to consider using mechanize, twill or even Selenium or Windmill, which will support more compliated scenerios, including cookies or javascript support.

For random website, it might be tricky to work around with urllib2 only (signed cookies, anyone?).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号