开发者

Script always gets a 302 response when pulling random pages from Wikipedia

开发者 https://www.devze.com 2023-02-27 09:35 出处:网络
I can pull a any page from wikipedia with import httplib conn = httplib.HTTPConnecti开发者_如何转开发on(\"en.wikipedia.org\")

I can pull a any page from wikipedia with

import httplib
conn = httplib.HTTPConnecti开发者_如何转开发on("en.wikipedia.org")
conn.debuglevel = 1
conn.request("GET","/wiki/Normal_Distribution",headers={'User-Agent':'Python httplib'})
r1 = conn.getresponse()
r1.read()

The normal response will be

reply: 'HTTP/1.0 200 OK\r\n'
header: Date: Sun, 03 Apr 2011 23:49:36 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Content-Language: en
header: Vary: Accept-Encoding,Cookie
header: Last-Modified: Sun, 03 Apr 2011 17:23:50 GMT
header: Content-Length: 263638
header: Content-Type: text/html; charset=UTF-8
header: Age: 1280309
header: X-Cache: HIT from sq77.wikimedia.org
header: X-Cache-Lookup: HIT from sq77.wikimedia.org:3128
header: X-Cache: MISS from sq66.wikimedia.org
header: X-Cache-Lookup: MISS from sq66.wikimedia.org:80
header: Connection: close

But if I try to pull a random page with /wiki/Special:Random I get a 302 response and an empty page

reply: 'HTTP/1.0 302 Moved Temporarily\r\n'
header: Date: Mon, 18 Apr 2011 19:25:52 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Vary: Accept-Encoding,Cookie
header: Expires: Thu, 01 Jan 1970 00:00:00 GMT
header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust
header: Content-Length: 0
header: Content-Type: text/html; charset=utf-8
header: X-Cache: MISS from sq60.wikimedia.org
header: X-Cache-Lookup: MISS from sq60.wikimedia.org:3128
header: X-Cache: MISS from sq62.wikimedia.org
header: X-Cache-Lookup: MISS from sq62.wikimedia.org:80
header: Connection: close

How do I get a non-empty random page?


The 302 is a redirect. It's telling you where to go in the following line:

header: Location: http://en.wikipedia.org/wiki/tuticorin_port_trust 

You just need to follow the redirect.


When you're redirected the response object is going to have a code of 302 and the geturl() method will report the redirect URL. Python's standard HTTP libraries make it non-trivial to handled redirects by default. Do yourself a favor, don't hassle with this stuff and use the 3rd party mechanize library, which is a drop-in replacement for urllib2.

Using mechanize, your code would look like this:

import httplib
import mechanize

host = 'en.wikipedia.org'
path = '/wiki/Special:Random'
url = 'http://' + host + path # We have to pass a http:// url

# It still uses httplib.HTTPConnection, so we can debug
httplib.HTTPConnection.debuglevel = 1

request = mechanize.Request(url, headers={'User-Agent': 'Python-mechanize'}) 
response = mechanize.urlopen(request)

print response.code
# => 200
print response.geturl()
# => 'http://en.wikipedia.org/wiki/Faliszowice,_Lesser_Poland_Voivodeship'
data = response.read()


HTTP code 302 means you are being redirected. If you look at the Location header, you will see where you should make your new request. Then you can make the request to that URL and you'll hopefully get a 200 on that page.

To clarify: You are being requested to retry the request elsewhere. That's why your client needs to make another request when it receives a 302. Wikipedia's random page apparently works by choosing a random page in its database, then returning a 302 response with the new page as the Location field. If you look at other 302 responses, I'm sure you'll see a different page in the Location field.


Look at the location header:

header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust

It says you should redirected to that page. Read that header and do another request to that page.

0

精彩评论

暂无评论...
验证码 换一张
取 消