I'm having trouble with this regex and I think I'm almost there.
m =re.findall('[a-z]{6}\.[a-z]{3}\.[a-z]{2} (?=\" target)', 'http://domain.com.uy " target')
This gives me the "exact" output that I want. that is domain.com.uy
but obviously this is just an example since [a-z]{6}
just matches the previous 6 characters and this is not what I want.
I want it to return domain.com.uy
so basically the instruction would be match a开发者_如何学Cny character until "/" is encountered (backwards).
Edit:
m =re.findall('\w+\.[a-z]{3}\.[a-z]{2} (?=\" target)', 'http://domain.com.uy " target')
Is very close to what I want but wont match "_" or "-".
For the sake of completeness I do not need the http://
I hope the question is clear enough, if I left anything open to interpretation please ask for any clarification needed!
Thank in advance!
Another option is to use a positive lookbehind such as (?<=//)
:
>>> re.search(r'(?<=//).+(?= \" target)',
... 'http://domain.com.uy " target').group(0)
'domain.com.uy'
Note that this will match slashes within the url itself, if that's desired:
>>> re.search(r'(?<=//).+(?= \" target)',
... 'http://example.com/path/to/whatever " target').group(0)
'example.com/path/to/whatever'
If you just wanted the bare domain, without any path or query parameters, you could use r'(?<=//)([^/]+)(/.*)?(?= \" target)'
and capture group 1:
>>> re.search(r'(?<=//)([^/]+)(/.*)?(?= \" target)',
... 'http://example.com/path/to/whatever " target').groups()
('example.com', '/path/to/whatever')
try this (maybe you need to escape /
in Python):
/([^/]*)$
If regular expressions are not a requirement and you simply wish to extract the FQDN from the URL in Python. Use urlparse
and str.split()
:
>>> from urlparse import urlparse
>>> url = 'http://domain.com.uy " target'
>>> urlparse(url)
ParseResult(scheme='http', netloc='domain.com.uy " target', path='', params='', query='', fragment='')
This has broken up the URL into its component parts. We want netloc
:
>>> urlparse(url).netloc
'domain.com.uy " target'
Split on whitespace:
>>> urlparse(url).netloc.split()
['domain.com.uy', '"', 'target']
Just the first part:
>>> urlparse(url).netloc.split()[0]
'domain.com.uy'
It's as simple as this:
[^/]+(?= " target)
But be aware that http://domain.com/folder/site.php will not return the domain. And remember to escape the regex properly in a string.
精彩评论