URL redirection problem_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-20 16:16 出处：网络

i have the below url http://bit.ly/cDdh1c When you place the above url in abrowser and hit enter it will redirect to the below url

i have the below url

http://bit.ly/cDdh1c

When you place the above url in a browser and hit enter it will redirect to the below url http://www.kennystopproducts.info/Top/?hop=arnishad

But where as when i try to find the base url (after eliminating all the redirect urls) for the same above url http://bit.ly/cDdh1c via a python program (below you can see the code) iam getting the following url http://www.cbtrends.com/ as base url.Please see the log file below

Why the same url is behaving different with browser and with a python program.What should i change in the python program so that it can redirect to the proper url?Iam wondering how this strange behaviour can happen.?

Other url for which iam observing similar behaviour is

http://bit.ly/bEKyOx ====> http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&item=150413977509 ( via browser)

http://www.ebay.com (via python program)

      maxattempts = 5
      turl = url
      while (maxattempts  >  0) :               
        host,path = urlparse.urlsplit(turl)[1:3]
        if  len(host.strip()) == 0 :
           return None

        try: 
                connection = httplib.HTTPConnection(host,timeout=10)
                connection.request("HEAD", path)
                resp = connection.getresponse()                      
        except:                         
                 return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            self.logger.debug("The present %s is a redirection one" %turl)
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            self.logger.debug("The present url %s is a proper one" %turl)
            return turl
        else :
            #some problem with this url
            return None               
      return None

Log file for your reference

2010-02-14 10:29:43,260 - paypallistener.views.MrCrawler - INFO - Bringing down the base URL
2010-02-14 10:29:43,261 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://bit.ly/cDdh1c
2010-02-14 10:29:43,994 - paypallistener.views.MrCrawler - DEBUG - The present http://bit.ly/cDdh1c is a redirection one
2010-02-14 10:29:43,995 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253开发者_运维问答D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter
2010-02-14 10:29:44,606 - paypallistener.views.MrCrawler - DEBUG - The present http://www.cbtrends.com/get-product.html?productid=reFfJcmpgGt95hoiavbXUAYIMP7OfiQn0qBA8BC7%252BV8%253D&affid=arnishad&tid=arnishad&utm_source=twitterfeed&utm_medium=twitter is a redirection one
2010-02-14 10:29:44,607 - paypallistener.views.MrCrawler - DEBUG - what is the url=http://www.cbtrends.com/
2010-02-14 10:29:45,547 - paypallistener.views.MrCrawler - DEBUG - The present url http://www.cbtrends.com/ is a proper one
http://www.cbtrends.com/

Your problem is that when you call urlsplit, your path variable only contains the path and is missing the query.

So, instead try:

import httplib
import urlparse

def getUrl(url):
    maxattempts = 10
    turl = url
    while (maxattempts  >  0) :               
        host,path,query = urlparse.urlsplit(turl)[1:4]
        if  len(host.strip()) == 0 :
            return None
        try: 
            connection = httplib.HTTPConnection(host,timeout=10)
            connection.request("GET", path+'?'+query)
            resp = connection.getresponse()
        except:                         
            return None                     
        maxattempts = maxattempts - 1
        if (resp.status >= 300) and (resp.status <= 399):
            turl = resp.getheader('location')
        elif (resp.status >= 200) and (resp.status <= 299) :
            return turl
        else :
            #some problem with this url
            return None               
    return None
print getUrl('http://bit.ly/cDdh1c')

Your problem comes from this line :

host,path = urlparse.urlsplit(turl)[1:3]

You're leaving out the query string. So on the example log you're providing, the second HEAD request you will do will be on http://www.cbtrends.com/get-product.html without the GET parameters. Open that URL in your browser and you'll see it redirects to http://www.cbtrends.com/.

You have to calculate the path using all elements of the tuple returned by urlsplit.

parts = urlparse.urlsplit(turl)
host = parts[1]
path = "%s?%s#%s" % parts[2:5]

URL redirection problem

精彩评论

关注公众号

热门标签

图文推荐

URL redirection problem

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：