I made a little parser using HTMLparser and I would like to know where a link is redirected. I don't开发者_高级运维 know how to explain this, so please look this example:
On my page I have a link on the source: http://www.myweb.com?out=147
, which redirects to http://www.mylink.com
. I can parse http://www.myweb.com?out=147
without any problems, but I don't know how to get http://www.mylink.com
.
You can use urllib2
(urllib.request
in Python 3) and its HTTPRedirectHandler
in order to find out where a URL will redirect you. Here's a function that does that:
import urllib2
def get_redirected_url(url):
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler)
request = opener.open(url)
return request.url
print get_redirected_url("http://google.com/")
# prints "http://www.google.com/"
You can not get hold of the redirection URL through parsing the HTML source code. Redirections are triggered by the server and NOT by the client. You need to perform a HTTP request to the related URL and check the HTTP response of the server - in particular for the HTTP status code 304 (Redirection) and the new URL.
精彩评论