Script always gets a 302 response when pulling random pages from Wikipedia_问答_开发者

I can pull a any page from wikipedia with

import httplib
conn = httplib.HTTPConnecti开发者_如何转开发on("en.wikipedia.org")
conn.debuglevel = 1
conn.request("GET","/wiki/Normal_Distribution",headers={'User-Agent':'Python httplib'})
r1 = conn.getresponse()
r1.read()

The normal response will be

reply: 'HTTP/1.0 200 OK\r\n'
header: Date: Sun, 03 Apr 2011 23:49:36 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Content-Language: en
header: Vary: Accept-Encoding,Cookie
header: Last-Modified: Sun, 03 Apr 2011 17:23:50 GMT
header: Content-Length: 263638
header: Content-Type: text/html; charset=UTF-8
header: Age: 1280309
header: X-Cache: HIT from sq77.wikimedia.org
header: X-Cache-Lookup: HIT from sq77.wikimedia.org:3128
header: X-Cache: MISS from sq66.wikimedia.org
header: X-Cache-Lookup: MISS from sq66.wikimedia.org:80
header: Connection: close

But if I try to pull a random page with /wiki/Special:Random I get a 302 response and an empty page

reply: 'HTTP/1.0 302 Moved Temporarily\r\n'
header: Date: Mon, 18 Apr 2011 19:25:52 GMT
header: Server: Apache
header: Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
header: Vary: Accept-Encoding,Cookie
header: Expires: Thu, 01 Jan 1970 00:00:00 GMT
header: Location: http://en.wikipedia.org/wiki/Tuticorin_Port_Trust
header: Content-Length: 0
header: Content-Type: text/html; charset=utf-8
header: X-Cache: MISS from sq60.wikimedia.org
header: X-Cache-Lookup: MISS from sq60.wikimedia.org:3128
header: X-Cache: MISS from sq62.wikimedia.org
header: X-Cache-Lookup: MISS from sq62.wikimedia.org:80
header: Connection: close

How do I get a non-empty random page?

The 302 is a redirect. It's telling you where to go in the following line:

header: Location: http://en.wikipedia.org/wiki/tuticorin_port_trust

You just need to follow the redirect.

When you're redirected the response object is going to have a code of 302 and the geturl() method will report the redirect URL. Python's standard HTTP libraries make it non-trivial to handled redirects by default. Do yourself a favor, don't hassle with this stuff and use the 3rd party mechanize library, which is a drop-in replacement for urllib2.

Using mechanize, your code would look like this:

import httplib
import mechanize

host = 'en.wikipedia.org'
path = '/wiki/Special:Random'
url = 'http://' + host + path # We have to pass a http:// url

# It still uses httplib.HTTPConnection, so we can debug
httplib.HTTPConnection.debuglevel = 1

request = mechanize.Request(url, headers={'User-Agent': 'Python-mechanize'}) 
response = mechanize.urlopen(request)

print response.code
# => 200
print response.geturl()
# => 'http://en.wikipedia.org/wiki/Faliszowice,_Lesser_Poland_Voivodeship'
data = response.read()

HTTP code 302 means you are being redirected. If you look at the Location header, you will see where you should make your new request. Then you can make the request to that URL and you'll hopefully get a 200 on that page.

To clarify: You are being requested to retry the request elsewhere. That's why your client needs to make another request when it receives a 302. Wikipedia's random page apparently works by choosing a random page in its database, then returning a 302 response with the new page as the Location field. If you look at other 302 responses, I'm sure you'll see a different page in the Location field.

Look at the location header: