I'm trying to create a regular expression that will match the third instance of a / in a url, i.e. so that only the website's name itself will be recorded, nothing else.
So http://www.stackoverflow.com/questions/answers/help/开发者_开发技巧 after being put through the regex will be http://www.stackoverflow.com
I've been playing about with them myself and come up with:
base_url = re.sub(r'[/].*', r'', url)
but all this does is reduce a link to http: - so it's obvious I need to match the third instance of / - can anyone explain how I would do this?
Thanks!
I suggest you use urlparse
for parsing URLs:
In [1]: from urlparse import urlparse
In [2]: urlparse('http://www.stackoverflow.com/questions/answers/help/').netloc
Out[2]: 'www.stackoverflow.com'
.netloc
includes the port number if present (e.g. www.stackoverflow.com:80
); if you don't want the port number, use .hostname
instead.
URLParse would work, but since you originally asked about Regexes, try a positive match instead of a negative substitution:
match = re.match(r'.+://[^/]+', url);
baseUrl = match.group();
This will grab the http://
(or https://
, or ftp://
), and everything after it until the first /
.
http://www.tutorialspoint.com/python/python_reg_expressions.htm
精彩评论