开发者

Regex battle between maximum and minimum munge

开发者 https://www.devze.com 2022-12-16 00:07 出处:网络
Greetings, I have file with the following strings: string.Format(\"{0},{1}\", \"Having \\\"Two\\\" On The Same Line\".Localize(), \"Is Tricky For regex\".Localize());

Greetings, I have file with the following strings:

string.Format("{0},{1}", "Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex".Localize());

my goal is to get a m开发者_高级运维atch set with the two strings:

  • Having \"Two\" On The Same Line
  • Is Tricky For regex

My current regex looks like this:

private Regex CSharpShortRegex = new Regex("\"(?<constant>[^\"]+?)\".Localize\\(\\)");

My problem is with the escaped quotes in the first line I end up stopping at the quote and I get:

  • On The Same Line
  • Is Tricky For This Style Too

however attempting to ignore the escaped quotes is not working out because it makes the Regex greedy and I get

  • Having \"Two\" On The Same Line".Localize(), "Is Tricky For regex"

We seem to be caught between maximum and minimum munge. Is there any hope? I have some backup plans. Can you Regex backwards? that would make it easier because I can start with the "()ezilacoL."

EDIT: To clarify. This is my lone edge case. Most of the time the string sits alone like:

var myString = "Hot Patootie".Localize()


This one works for me:

\"((?:[^\\"]|(?:\\\"))*)\"\.Localize\(\)

Tested on http://www.regexplanet.com/simple/index.html against a number of strings with various escaped quotes.

Looks like most of us who answered this one had the same rough idea, so let me explain the approach (comments after #s):

\"             # We're looking for a string delimited by quotation marks
(              # Capture the contents of the quotation marks
  (?:          #   Start a non-capturing group
    [^\\"]     #     Either read a character that isn't a quote or a slash
    |(?:\\\")  #     Or read in a slash followed by a quote.
  )*           #   Keep reading
)              # End the capturing group
\"             # The string literal ends in a quotation mark
\.Localize\(\) # and ends with the literal '.Localize()', escaping ., ( and )

For C# you'll need to escape the slashes twice (messy):

\"((?:[^\\\\\"]|(?:\\\\\"))*)\"\\.Localize\\(\\)

Mark correctly points out that this one doesn't match escaped characters other than quotation marks. So here's a better version:

\"((?:[^\\"]|(?:\\")|(?:\\.))*)\"\.Localize\(\)

And its slashed-up equivalent:

\"((?:[^\\\\\"]|(?:\\\\\")|(?:\\\\.))*)\"\\.Localize\\(\\)

Works the same way, except it has a special case that if encounters a slash but it can't match \", it just consumes the slash and the following character and moves on.


Thinking about it, it's better to just consume two characters at every slash, which is effectively Mark's answer so I won't repeat it.


Here's the regular expression you need:

@"""(?<constant>(\\.|[^""])*)""\.Localize\(\)"

A test program:

using System; using System.Text.RegularExpressions; using System.IO;

class Program
{
    static void Main()
    {
        Regex CSharpShortRegex =
            new Regex(@"""(?<constant>(\\.|[^""])*)""\.Localize\(\)");

        foreach (string line in File.ReadAllLines("input.txt"))
            foreach (Match match in CSharpShortRegex.Matches(line))
                Console.WriteLine(match.Groups["constant"].Value);
    }
}

Output:

Having \"Two\" On The Same Line
Is Tricky For regex
Hot Patootie

Notice that I have used @"..." to avoid having to escape backslashes inside the regular expression. I think this makes it easier to read.


Update:

My original answer (below the horizontal rule) has a bug: regular-expression matchers attempt alternatives in left-to-right order. Having [^"] as the first alternative allows it to consume the backslash, but then the next character to be matched is a quote, which prevents the match from proceeding.

Incompatibility note: Given the pattern below, perl backtracks to the other alternative (the escaped quote) and successfully finds a match for the Having \"Two\" On The Same Line case.

The fix is to try an escaped quote first and then a non-quote:

var CSharpShortRegex =
  new Regex("\"(?<constant>(\\\\\"|[^\"])*)\"\\.Localize\\(\\)");

or if you prefer the at-string form:

var CSharpShortRegex =
  new Regex(@"""(?<constant>(\\""|[^""])*)""\.Localize\(\)");

Allow for escapes:

private Regex CSharpShortRegex =
  new Regex("\"(?<constant>([^\"]|\\\\\")*)\"\\.Localize\\(\\)");

Applying one level of escaping to make the pattern easier to read, we get

"(?<constant>([^"]|\\")*)"\.Localize\(\)

That is, a string starts and ends with " characters, and everything between is either a non-quote or an escaped quote.


Looks like you're trying to parse code so one approach might be to evaluate the code on the fly:

var cr = new CSharpCodeProvider().CompileAssemblyFromSource(
    new CompilerParameters { GenerateInMemory = true }, 
    "class x { public static string e() { return " + input + "}}");

var result = cr.CompiledAssembly.GetType("x")
    .GetMethod("e").Invoke(null, null) as string;

This way you could handle all kinds of other special cases (e.g. concatenated or verbatim strings) that would be extremely difficult to handle with regex.


new Regex(@"((([^@]|^|\n)""(?<constant>((\\.)|[^""])*)"")|(@""(?<constant>(""""|[^""])*)""))\s*\.\s*Localize\s*\(\s*\)", RegexOptions.Compiled);

takes care of both simple and @"" strings. It also takes into account escape sequences.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号