Regex to strip line comments from C#_问答_开发者

I'm working on a routine to strip block or line comments from some C# code. I have looked at the other examples on the site, but haven't found the exact answer that I'm looking for.

I can match block comments (/* comment */) in their entirety using this regular expression with RegexOptions.Singleline:

(/\*[\w\W]*\*/)

And I can match line comments (// comment) in their entirety using this regular expression with RegexOptions.Multiline:

(//((?!\*/).)*)(?!\*/)[^\r\n]

Note: I'm using [^\r\n] instead of $ because $ is including \r in the match, too.

However, this doesn't quite work the way I want it to.

Here is my test code that I'm matching against:

// remove whole line comments
bool broken = false; // remove partial line comments
if (broken == true)
{
    return "BROKEN";
}
/* remove block comments
else
{
    return "FIXED";
} //开发者_开发技巧 do not remove nested comments */ bool working = !broken;
return "NO COMMENT";

The block expression matches

/* remove block comments
else
{
    return "FIXED";
} // do not remove nested comments */

which is fine and good, but the line expression matches

// remove whole line comments
// remove partial line comments

and

// do not remove nested comments

Also, if I do not have the */ positive lookahead in the line expression twice, it matches

// do not remove nested comments *

which I really don't want.

What I want is an expression that will match characters, starting with //, to the end of line, but does not contain */ between the // and end of line.

Also, just to satisfy my curiosity, can anyone explain why I need the lookahead twice? (//((?!\*/).)*)[^\r\n] and (//(.)*)(?!\*/)[^\r\n] will both include the *, but (//((?!\*/).)*)(?!\*/)[^\r\n] and (//((?!\*/).)*(?!\*/))[^\r\n] won't.

Both of your regular expressions (for block and line comments) have bugs. If you want I can describe the bugs, but I felt it’s perhaps more productive if I write new ones, especially because I’m intending to write a single one that matches both.

The thing is, every time you have /* and // and literal strings “interfering” with each other, it is always the one that starts first that takes precedence. That’s very convenient because that’s exactly how regular expressions work: find the first match first.

So let’s define a regular expression that matches each of those four tokens:

var blockComments = @"/\*(.*?)\*/";
var lineComments = @"//(.*?)\r?\n";
var strings = @"""((\\[^\n]|[^""\n])*)""";
var verbatimStrings = @"@(""[^""]*"")+";

To answer the question in the title (strip comments), we need to:

Replace the block comments with nothing
Replace the line comments with a newline (because the regex eats the newline)
Keep the literal strings where they are.

Regex.Replace can do this easily using a MatchEvaluator function:

string noComments = Regex.Replace(input,
    blockComments + "|" + lineComments + "|" + strings + "|" + verbatimStrings,
    me => {
        if (me.Value.StartsWith("/*") || me.Value.StartsWith("//"))
            return me.Value.StartsWith("//") ? Environment.NewLine : "";
        // Keep the literal strings
        return me.Value;
    },
    RegexOptions.Singleline);

I ran this code on all the examples that Holystream provided and various other cases that I could think of, and it works like a charm. If you can provide an example where it fails, I am happy to adjust the code for you.

You could tokenize the code with an expression like:

@(?:"[^"]*")+|"(?:[^"\n\\]+|\\.)*"|'(?:[^'\n\\]+|\\.)*'|//.*|/\*(?s:.*?)\*/

It would also match some invalid escapes/structures (eg. 'foo'), but will probably match all valid tokens of interest (unless I forgot something), thus working well for valid code.

Using it in a replace and capturing the parts you want to keep will give you the desired result. I.e:

static string StripComments(string code)
{
    var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/";
    return Regex.Replace(code, re, "$1");
}

Example app:

using System;
using System.Text.RegularExpressions;

namespace Regex01
{
    class Program
    {
        static string StripComments(string code)
        {
            var re = @"(@(?:""[^""]*"")+|""(?:[^""\n\\]+|\\.)*""|'(?:[^'\n\\]+|\\.)*')|//.*|/\*(?s:.*?)\*/";
            return Regex.Replace(code, re, "$1");
        }

        static void Main(string[] args)
        {
            var input = "hello /* world */ oh \" '\\\" // ha/*i*/\" and // bai";
            Console.WriteLine(input);

            var noComments = StripComments(input);
            Console.WriteLine(noComments);
        }
    }
}

Output:

hello /* world */ oh " '\" // ha/*i*/" and // bai
hello  oh " '\" // ha/*i*/" and

Before you implement this, you will need to create test cases for it first

Simple comments /* */, //, ///
Multi line comments /* This\nis\na\ntest*/
Comments after line of code var a = "apple"; // test or /* test */
Comments within comments /* This // is a test /, or // This / is a test */
Simple non comments that look like comments, and appears in quotes var comment= "/* This is a test*/", or var url = "http://stackoverflow.com";
Complex non comments taht look like comments: var abc = @" this /* \n is a comment in quote\n*/", with or without spaces between " and /* or */ and "

There are probably more cases out there.

Once you have all of them, then you can create a parsing rule for each of them, or group some of them.

Solving this with regular expression alone probably will be very hard and error-prone, hard to test, and hard to maintain by you and other programmers.

I found this one at http://gskinner.com/RegExr/ (named ".Net Comments aspx")

(//[\t|\s|\w|\d|\.]*[\r\n|\n])|([\s|\t]*/\*[\t|\s|\w|\W|\d|\.|\r|\n]*\*/)|(\<[!%][ \r\n\t]*(--([^\-]|[\r\n]|-[^\-])*--[ \r\n\t%]*)\>)

When I test it it seems to remove all // comments and /* comments */ as it should, leaving those inside quotes behind.

Haven't tested it a lot, but seems to work pretty well (even though its a horrific monstrous line of regex).

for block Comments (/* ... */) you can use this exp:

/\*([^\*/])*\*/

it will work with multiline comments also.

Also see my project for C# code minification: CSharp-Minifier

Aside of removing of comments, spaces and and line breaks from code, at present time it's able to compress local variable names and do another minifications.