I need to parse an URL. I'm currently using urlparse.urlparse() and urlparse.urlsplit().
The problem is that i can't get the "netloc" (host) from the URL when it's not present the scheme. I mean, if i have the following URL:
www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1
I can't get the netloc: www.amazon.com
According to python docs:
Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.
So, it's this way on purpose. But, i still don't know how to get the netloc from that URL.
I think i could check if the scheme is present, and if it's not, then add it, and then parse it. But this solution doesn't seems really good.
Do you have a better idea?
EDIT: Thanks for all the answers. But, i cannot do the "startswith" thing that's proposed by Corey and others. Becouse, if i get an URL with other protocol/scheme i would mess i开发者_高级运维t up. See:
If i get this URL:
ftp://something.com
With the code proposed i would add "http://" to the start and would mess it up.
The solution i found
if not urlparse.urlparse(url).scheme:
url = "http://"+url
return urlparse.urlparse(url)
Something to note:
I do some validation first, and if no scheme is given i consider it to be http://
looks like you need to specify the protocol to get netloc.
adding it if it's not present might look like this:
import urlparse
url = 'www.amazon.com/Programming-Python-Mark-Lutz'
if '//' not in url:
url = '%s%s' % ('http://', url)
p = urlparse.urlparse(url)
print p.netloc
More about the issue: https://bugs.python.org/issue754016
The documentation has this exact example, just below the text you pasted. Adding '//' if it's not there will get what you want. If you don't know whether it'll have the protocol and '//' you can use a regex (or even just see if it already contains '//') to determine whether or not you need to add it.
Your other option would be to use split('/') and take the first element of the list it returns, which will ONLY work when the url has no protocol or '//'.
EDIT (adding for future readers): a regex for detecting the protocol would be something like re.match('(?:http|ftp|https)://', url)
If the protocol is always http you can use only one line:
return "http://" + url.split("://")[-1]
A better option is to use the protocol if it passed:
return url if "://" in url else "http://" + url
From the docs:
Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.
So you can just do:
In [1]: from urlparse import urlparse
In [2]: def get_netloc(u):
...: if not u.startswith('http'):
...: u = '//' + u
...: return urlparse(u).netloc
...:
In [3]: get_netloc('www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[3]: 'www.amazon.com'
In [4]: get_netloc('http://www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[4]: 'www.amazon.com'
In [5]: get_netloc('https://www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106/ref=sr_1_1?ie=UTF8&qid=1308060974&sr=8-1')
Out[5]: 'www.amazon.com'
Have you considered just checking for the presence of "http://" at the start of the url, and add it if it's not there? Another solution, assuming that first part really is the netloc and not part of a relative url, is to just grab everything up to the first "/" and use that as the netloc.
This one liner would do it.
netloc = urlparse('//' + ''.join(urlparse(url)[1:])).netloc
精彩评论