开发者

Regular expressions - C# behaves differently than Perl / Python

开发者 https://www.devze.com 2023-04-01 23:38 出处:网络
Under Python: ttsiod@elrond:~$ python >>> import re >>> a=\'This is a test\' >>> re.sub(r\'(.开发者_Python百科*)\', \'George\', a)

Under Python:

ttsiod@elrond:~$ python
>>> import re
>>> a='This is a test'
>>> re.sub(r'(.开发者_Python百科*)', 'George', a)
'George'

Under Perl:

ttsiod@elrond:~$ perl
$a="This is a test";
$a=~s/(.*)/George/;
print $a;
(Ctrl-D)

George

Under C#:

using System;
using System.Collections.Generic;
using System.Text;
using System.Threading;
using System.Text.RegularExpressions;

namespace IsThisACsharpBug
{
  class Program
  {
    static void Main(string[] args)
    {
        var matchPattern = "(.*)";
        var replacePattern = "George";
        var newValue = Regex.Replace("This is nice", matchPattern, replacePattern);
        Console.WriteLine(newValue);
    }
  }
}

Unfortunately, C# prints:

$ csc regexp.cs
Microsoft (R) Visual C# 2008 Compiler version 3.5.30729.5420
for Microsoft (R) .NET Framework version 3.5
Copyright (C) Microsoft Corporation. All rights reserved.

$ ./regexp.exe 
GeorgeGeorge

Is this a bug in the regular expression library of C# ? Why does it print "George" two times, when Perl and Python just print it once?


In your example the difference seems to be in the semantics of the 'replace' function rather than in the regular expression processing itself.

.net is doing a "global" replace, i.e. it is replacing all matches rather than just the first match.

Global Replace in Perl

(notice the small 'g' at the end of the =~s line)

$a="This is a test";
$a=~s/(.*)/George/g;
print $a;

which produces

GeorgeGeorge

Single Replace in .NET

var re = new Regex("(.*)");
var replacePattern = "George";
var newValue = re.Replace("This is nice", replacePattern, 1) ;
Console.WriteLine(newValue);

which produces

George

since it stops after the first replacement.


It's not clear to me whether it's a bug or not, but if you change the .* to .+ it does what you want. I suspect it's the fact that (.*) matches an empty string which is confusing things.

This is backed up by the following code:

using System;
using System.Text.RegularExpressions;

class Test
{
    static void Main()
    {
        var match = Regex.Match("abc", "(.*)");
        while (match.Success)
        {
            Console.WriteLine(match.Length);
            match = match.NextMatch();
        }
    }
}

That prints out 3 then 0. Changing the pattern to "(.+)" makes it just print out 3.

One point to note is that this has nothing to do with C# as a language - only the .NET standard libraries. It's worth distinguishing between language and libraries - for example, you'll get exactly the same behaviour if you use the .NET standard library from F#, VB, C++/CLI etc.


Is this a bug in the regular expression library of C#

Maybe, but that doesn't really answer you question:

Regular expressions - C# behaves differently than Perl / Python

Different regular expression engines and implementations do behave differently. Some times this explicit (and includes supporting different regular expression elements and syntax: eg. using \( and \) for grouping rather than plain parentheses with a backslash for grouping).

The book Mastering Regular Expressions (Jeffrey E.F. Friedl, O'Reilly) spends a lot of time explaining these differences (on top of the more fundamental differences between non-deterministic finite automata (NFA) and deterministic finite automata (DFA) approaches).

PS. As others note .* matches the empty string, so first "all" your input string is matched and replaced, then the empty string at the end of the input is matched and replaced. If you want to match the whole, but possibly empty, input include anchors for the beginning and end: ^(.*)$.


The replacement of "" is "George" (.* matches "")

and

"This is a start" == "This is a start" + "" 

So the regex matches "This is a start" and replaces it with "George", and now its "cursor" is at the end of the string, where he tries again to match the remaining string ("") with the pattern. He has a a match so he adds a second "George". I don't know if this is correct or not.

I'll add that the Javascript engine seems to do the same thing (tested here: http://www.regular-expressions.info/javascriptexample.html ) under IE and Chrome.

0

精彩评论

暂无评论...
验证码 换一张
取 消