when I try to extract this video ID (AIiMa2Fe-ZQ) with a regex expression, I开发者_开发技巧 can't get the dash an all the letters after.
>>> id = re.search('(?<=\?v\=)\w+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
>>> print id.group(0)
>>> AIiMa2Fe
Intead of \w+ use below. Word character (\w) doesn't include a dash. It only includes [a-zA-Z_0-9].
[\w-]+
I don't know the pattern for youtube hashes, but just include the "-" in the possibilities as it is not considered an alpha:
import re
id = re.search('(?<=\?v\=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ')
print id.group(0)
I have edited the above because as it turns out:
>>> re.search("[\w|-]", "|").group(0)
'|'
The "|" in the character definition does not act as a special character but does indeed match the "|" pipe. My apologies.
>>> re.search('(?<=v=)[\w-]+', 'http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group()
'AIiMa2Fe-ZQ'
\w
is a short-hand for [a-zA-Z0-9_]
in python2.x, you'll have to use re.A
flag in py3k. You quite clearly have additional character in that videoid, i.e., hyphen. I've also removed redundant escape backslashes from the lookbehind.
Use the urlparse module instead of regex for such kind of things.
import urlparse
parsed_url = urlparse.urlparse(url)
if parsed_url.netloc.find('youtube.com') != -1 and parsed_url.path == '/watch':
video = urlparse.parse_qs(parsed_url.query).get('v', None)
if video is None:
video = urlparse.parse_qs(parsed_url.fragment.strip('!')).get('v', None)
if video is not None:
print video[0]
EDIT: Updated for the upcoming new youtube url format.
/(?:/v/|/watch\?v=|/watch#!v=)([A-Za-z0-9_-]+)/
Explain the RE
There are three alternate YouTube formats: /v/[ID]
and watch?v=
and the new AJAX watch#!v=
This RE captures all three. There is also new YouTube URL for user pages that is of the form /user/[user]?content={complex URI} This is not captured here by any regex...
I'd try this:
>>> import re
>>> a = re.compile(r'.*(\-\w+)$')
>>> a.search('http://www.youtube.com/watch?v=AIiMa2Fe-ZQ').group(1)
'-ZQ'
精彩评论