开发者

Regex to ensure group match doesn't end with a specific character

开发者 https://www.devze.com 2022-12-30 16:57 出处:网络
I\'m having trouble coming up with a regular expression to match a particular case.I have a list of tv shows in about 4 formats:

I'm having trouble coming up with a regular expression to match a particular case. I have a list of tv shows in about 4 formats:

  • Name.Of.Show.S01E01
  • Name.Of.Show.0101
  • Name.Of.Show.01x01
  • Name.Of.Show.101

What I want to match is the show name. My main problem is that my regex matches the name of the show with a preceding '.'. My regex is the following:

"^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})"

Some Examples:

>>> import开发者_Python百科 re

>>> SHOW_INFO = re.compile("^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})")
>>> match = SHOW_INFO.match("Name.Of.Show.S01E01")
>>> match.groups()
('Name.Of.Show.', 'S01E01')
>>> match = SHOW_INFO.match("Name.Of.Show.0101")
>>> match.groups()
('Name.Of.Show.0', '101')
>>> match = SHOW_INFO.match("Name.Of.Show.01x01")
>>> match.groups()
('Name.Of.Show.', '01x01')
>>> match = SHOW_INFO.match("Name.Of.Show.101")
>>> match.groups()
('Name.Of.Show.', '101')

So the question is how do I avoid the first group ending with a period? I realize I could simply do:

var.strip(".")

However, that doesn't handle the case of "Name.Of.Show.0101". Is there a way I could improve the regex to handle that case better?

Thanks in advance.


I think this will do:

>>> regex = re.compile(r'^([0-9a-z.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}x[0-9]{2})$', re.I)
>>> regex.match('Name.Of.Show.01x01').groups()
('Name.Of.Show', '01x01')
>>> regex.match('Name.Of.Show.101').groups()
('Name.Of.Show', '101')

ETA: Of course, if you're just trying to extract different bits from trusted strings you could just use string methods:

>>> 'Name.Of.Show.101'.rpartition('.')
('Name.Of.Show', '.', '101')


So the only real restriction on the last group is that it doesn’t contain a dot? Easy:

^(.*?)(\.[^.]+)$

This matches anything, non-greedily. The important part is the second group, which starts with a dot and then matches any non-dot character until the end of the string.

This works with all your test cases.


It seems like the problem is that you haven't specified that the period before the last group is required, so something like ^([0-9a-zA-Z\.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3}) might work.


I believe this will do what you want:

^([0-9a-z\.]+)\.(?:S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}(?:x[0-9]+)?)$

I tested this against the following list of shows:

  • 30.Rock.S01E01
  • The.Office.0101
  • Lost.01x01
  • How.I.Met.Your.Mother.101

If those 4 cases are representative of the types of files you have, then that regex should place the show title in its own capture group and toss away the rest. This filter is, perhaps, a bit more restrictive than some others, but I'm a big fan of matching exactly what you need.


If the last part never contains a dot: ^(.*)\.([^\.]+)$

0

精彩评论

暂无评论...
验证码 换一张
取 消