Python's urllib2 doesn't work on some sites_问答_开发者

Python's urllib2 doesn't work on some sites

开发者 https://www.devze.com 2022-12-25 19:16 出处：网络

I found that you can\'t read from some sites using Python\'s urllib2(or 开发者_JAVA百科urllib). An example...

I found that you can't read from some sites using Python's urllib2(or 开发者_JAVA百科urllib). An example...

urllib2.urlopen("http://www.dafont.com/").read()
# Returns ''

These sites work when you visit the site with a browser. I can even scrape them using PHP(didn't try other languages). I have seen other sites with the same issue - but can't remember the URL at the moment.

My questions are...

What is the cause of this issue?
Any workarounds?

I believe it gets blocked by the User-Agent. You can change User-Agent using the following sample code:

USERAGENT = 'something'
HEADERS = {'User-Agent': USERAGENT}

req = urllib2.Request(URL_HERE, headers=HEADERS)
f = urllib2.urlopen(req)
s = f.read()
f.close()

Try setting a different user agent. Check the answers in this link.

I'm the guy who posted the question. I have some suspicions - but not sure about them - that's why I posted the question here.

What is the cause of this issue?

I think its due to the host blocking the urllib library using robot.txt or htaccess. But not sure about it. Not even sure if its possible.

Any workaround for this issue?

If you are in Unix, this will work...

contents = commands.getoutput("curl -s '"+url+"'")

Python's urllib2 doesn't work on some sites

What is the cause of this issue?

Any workaround for this issue?

精彩评论

关注公众号

热门标签

图文推荐

Python's urllib2 doesn't work on some sites

What is the cause of this issue?

Any workaround for this issue?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：