开发者

C# Regex for Movie Filename

开发者 https://www.devze.com 2023-02-10 08:39 出处:网络
I have been trying to use a C# Regex unsuccessfully to remove certain strings from a movie name. Examples of the file names I\'m working with are:

I have been trying to use a C# Regex unsuccessfully to remove certain strings from a movie name.

Examples of the file names I'm working with are:

EuroTrip (2004) [SD]

Event Horizon (1997) [720]

Fast & Furious (2009) [1080p]

Star Trek (2009) [Unknown]

I'd like to remove anything in square brackets or parenthesis (including the brackets themselves)

So far I'm using:

movieTitleToFetch = Regex.Replace(movieTitleToFetch, "([*\\(\\d{4}\\)])", "");

Which seems to remove the Year and Parenthesis ok, but I just can't figure out how to remove the Square Brackets and content without affecting other parts... I've had miscellaneous results but the closest one has been:

movieTitleToFetch = Regex.Replace(movieTitleToFetch开发者_运维百科, "([?\\[+A-Z+\\]])", "");

Which left me with:

urorip (2004)

Instead of:

EuroTrip (2004) [SD]

Any whitespace that is left at the ends are ok as I will just perform

movieTitleToFetch = movieTitleToFetch.Trim();

at the end.

Thanks in advance,

Alex


This regex pattern should work ok... maybe needs a bit of tweaking

"[\[\(].+?[\]\)]"

Regex.Replace(movieTitleToFetch, @"[\[\(].+?[\]\)]", "");

This should match anything from either "[" or "(" until the next occurance of "]" or ")"

If that does not work try removing the escape character for the parentheses, like so...

Regex.Replace(movieTitleToFetch, @"[\[(].+?[\])]", "");


@Craigt is pretty much spot on but it's possibly cleaner to ensure that the brackets are matched.

([\[].*?[\]]|[\(].*?[\)]) 


I'know i'm late on this thread but i wrote a simple algorythm to sanitize the downloaded movies filenames.

This runs these steps:

  1. Removes everything in brackets (if find a year it tries to keep the info)
  2. Removes a list of common used words (720p, bdrip, h264 and so on...)
  3. Assumes that can be languages info in the title and removes them when at the end of remaining string (before special words)
  4. if a year was not found into parenthesis looks at the end of remaining string (as for languages)

Doing this replaces dots and spaces so the title is ready, as example, to be a query for a search api.

Here's the test in XUnit (i used most of italian titles to test it)

using Grappachu.Movideo.Core.Helpers.TitleCleaner;
using SharpTestsEx;
using Xunit;

namespace Grappachu.MoVideo.Test
{
    public class TitleCleanerTest
    {
        [Theory]
        [InlineData("Avengers.Confidential.La.Vedova.Nera.E.Punisher.2014.iTALiAN.Bluray.720p.x264 - BG.mkv",
            "Avengers Confidential La Vedova Nera E Punisher", 2014)]
        [InlineData("Fuck You, Prof! (2013) BDRip 720p HEVC ITA GER AC3 Multi Sub PirateMKV.mkv",
            "Fuck You, Prof!", 2013)]
        [InlineData("Il Libro della Giungla(2016)(BDrip1080p_H264_AC3 5.1 Ita Eng_Sub Ita Eng)by siste82.avi",
            "Il Libro della Giungla", 2016)]
        [InlineData("Il primo dei bugiardi (2009) [Mux by Little-Boy]", "Il primo dei bugiardi", 2009)]
        [InlineData("Il.Viaggio.Di.Arlo-The.Good.Dinosaur.2015.DTS.ITA.ENG.1080p.BluRay.x264-BLUWORLD",
            "il viaggio di arlo", 2015)]
        [InlineData("La Mafia Uccide Solo D'estate 2013 .avi",
            "La Mafia Uccide Solo D'estate", 2013)]
        [InlineData("Ip.Man.3.2015.iTA.AC3.5.1.448.Chi.Aac.BluRay.m1080p.x264.Sub.[scambiofile.info].mkv",
            "Ip Man 3", 2015)]
        [InlineData("Inferno.2016.BluRay.1080p.AC3.ITA.AC3.ENG.Subs.x264-WGZ.mkv",
            "Inferno", 2016)]
        [InlineData("Ghostbusters.2016.iTALiAN.BDRiP.EXTENDED.XviD-HDi.mp4",
            "Ghostbusters", 2016)]
        [InlineData("Transcendence.mkv", "Transcendence", null)]
        [InlineData("Being Human (Forsyth, 1994).mkv", "Being Human", 1994)]
        public void Clean_should_return_title_and_year_when_possible(string filename, string title, int? year)
        {
            var res = MovieTitleCleaner.Clean(filename);

            res.Title.ToLowerInvariant().Should().Be.EqualTo(title.ToLowerInvariant());
            res.Year.Should().Be.EqualTo(year);
        }
    }
}

and fisrt version of the code

using System;
using System.Globalization;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions; 

namespace Grappachu.Movideo.Core.Helpers.TitleCleaner
{
    public class MovieTitleCleanerResult
    {
        public string Title { get; set; }
        public int? Year { get; set; }
        public string SubTitle { get; set; }
    }

    public class MovieTitleCleaner
    {
        private const string SpecialMarker = "§=§";
        private static readonly string[] ReservedWords;
        private static readonly string[] SpaceChars;
        private static readonly string[] Languages;

        static MovieTitleCleaner()
        {
            ReservedWords = new[]
            {
                SpecialMarker, "hevc", "bdrip", "Bluray", "x264", "h264", "AC3", "DTS", "480p", "720p", "1080p"
            };
            var cultures = CultureInfo.GetCultures(CultureTypes.AllCultures);
            var l = cultures.Select(x => x.EnglishName).ToList();
            l.AddRange(cultures.Select(x => x.ThreeLetterISOLanguageName));
            Languages = l.Distinct().ToArray();


            SpaceChars = new[] {".", "_", " "};
        }


        public static MovieTitleCleanerResult Clean(string filename)
        {
            var temp = Path.GetFileNameWithoutExtension(filename);
            int? maybeYear = null;

            // Remove what's inside brackets trying to keep year info.
            temp = RemoveBrackets(temp, '{', '}', ref maybeYear);
            temp = RemoveBrackets(temp, '[', ']', ref maybeYear);
            temp = RemoveBrackets(temp, '(', ')', ref maybeYear);

            // Removes special markers (codec, formats, ecc...)
            var tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
            var title = string.Empty;
            for (var i = 0; i < tokens.Length; i++)
            {
                var tok = tokens[i];
                if (ReservedWords.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
                {
                    if (title.Length > 0)
                        break;
                }
                else
                {
                    title = string.Join(" ", title, tok).Trim();
                }
            }
            temp = title;

            // Remove languages infos when are found before special markers (should not remove "English" if it's inside the title)
            tokens = temp.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
            for (var i = tokens.Length - 1; i >= 0; i--)
            {
                var tok = tokens[i];
                if (Languages.Any(x => string.Equals(x, tok, StringComparison.OrdinalIgnoreCase)))
                    tokens[i] = string.Empty;
                else
                    break;
            }
            title = string.Join(" ", tokens).Trim();


            // If year is not found inside parenthesis try to catch at the end, just after the title
            if (!maybeYear.HasValue)
            {
                var resplit = title.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
                var last = resplit.Last();
                if (LooksLikeYear(last))
                {
                    maybeYear = int.Parse(last);
                    title = title.Replace(last, string.Empty).Trim();
                }
            }


            // TODO: review this. when there's one dash separates main title from subtitle 
            var res = new MovieTitleCleanerResult();
            res.Year = maybeYear;
            if (title.Count(x => x == '-') == 1)
            {
                var sp = title.Split('-');
                res.Title = sp[0];
                res.SubTitle = sp[1];
            }
            else
            {
                res.Title = title;
            }


            return res;
        }

        private static string RemoveBrackets(string inputString, char openChar, char closeChar, ref int? maybeYear)
        {
            var str = inputString;
            while (str.IndexOf(openChar) > 0 && str.IndexOf(closeChar) > 0)
            {
                var dataGraph = str.GetBetween(openChar.ToString(), closeChar.ToString());
                if (LooksLikeYear(dataGraph))
                {
                    maybeYear = int.Parse(dataGraph);
                }
                else
                {
                    var parts = dataGraph.Split(SpaceChars, StringSplitOptions.RemoveEmptyEntries);
                    foreach (var part in parts)
                        if (LooksLikeYear(part))
                        {
                            maybeYear = int.Parse(part);
                            break;
                        }
                }
                str = str.ReplaceBetween(openChar, closeChar, string.Format(" {0} ", SpecialMarker));
            }
            return str;
        }

        private static bool LooksLikeYear(string dataRound)
        {
            return Regex.IsMatch(dataRound, "^(19|20)[0-9][0-9]");
        }
    }


    public static class StringUtils
    {
        public static string GetBetween(this string src, string a, string b,
            StringComparison comparison = StringComparison.Ordinal)
        {
            var idxStr = src.IndexOf(a, comparison);
            var idxEnd = src.IndexOf(b, comparison);
            if (idxStr >= 0 && idxEnd > 0)
            {
                if (idxStr > idxEnd)
                    Swap(ref idxStr, ref idxEnd);
                return src.Substring(idxStr + a.Length, idxEnd - idxStr - a.Length);
            }
            return src;
        }

        private static void Swap<T>(ref T idxStr, ref T idxEnd)
        {
            var temp = idxEnd;
            idxEnd = idxStr;
            idxStr = temp;
        }

        public static string ReplaceBetween(this string s, char begin, char end, string replacement = null)
        {
            var regex = new Regex(string.Format("\\{0}.*?\\{1}", begin, end));
            return regex.Replace(s, replacement ?? string.Empty);
        }
    }
}


This does the trick:

@"(\[[^\]]*\])|(\([^\)]*\))"

It removes anything from "[" to the next "]" and anything from "(" to the next ")".


Can you just use:

string MovieTitle="Star Trek (2009) [Unknown]";
movieTitleToFetch= MovieTitle.IndexOf('(')>MovieTitle.IndexOf('[')?
                    MovieTitle.Substring(0,MovieTitle.IndexOf('[')):
                    MovieTitle.Substring(0,MovieTitle.IndexOf('('));


Cant we use this instead:-

if(movieTitleToFetch.Contains("("))
         movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));

Above code will surely return you the perfect movie titles for these strings:-

EuroTrip (2004) [SD]

Event Horizon (1997) [720]

Fast & Furious (2009) [1080p]

Star Trek (2009) [Unknown]

if there occurs a case where you will not have year but only type i.e :-

EuroTrip [SD]

Event Horizon [720]

Fast & Furious [1080p]

Star Trek [Unknown]

then use this

if(movieTitleToFetch.Contains("("))
         movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("("));
else if(movieTitleToFetch.Contains("["))
         movieTitleToFetch=movieTitleToFetch.Substring(0,movieTitleToFetch.IndexOf("["));


I came up with .+\s(?<year>\(\d{4}\))\s(?<format>\[\w+\]) which matches any of your examples, and contains the year and format as named capture groups to help you replace them.

This pattern translates as:

Any character, one or more repitions
Whitespace
Literal '(' followed by 4 digits followed by literal ')' (year)
Whitespace
Literal '[' followed by alphanumeric, one or more repitions, followed by literal ']' (format)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号