I'm making a program to automate the writing of some C code, (I'm writing to parse strings into enumerations with the same name) C's handling of strings is not that great. So some people have been nagging me to try python.
I made a function that is supposed to remove C-style /* COMMENT */
and //COMMENT
from a string:
Here is the code:
def removeComments(string):
re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurance streamed comments (/*COMMENT */) from string
re.sub(re.compile("//开发者_开发问答.*?\n" ) ,"" ,string) # remove all occurance singleline comments (//COMMENT\n ) from string
So I tried this code out.
str="/* spam * spam */ eggs"
removeComments(str)
print str
And it apparently did nothing.
Any suggestions as to what I've done wrong?
There's a saying I've heard a couple of times:
If you have a problem and you try to solve it with Regex you end up with two problems.
EDIT: Looking back at this years later. (after a fair bit more parsing experience)
I think regex may have been the right solution.
And the simple regex used here "good enough".
I may not have emphasized this enough in the question.
This was for a single specific file. That had no tricky situations.
I think it would be a lot less maintenance to keep the file being parsed simple enough for the regex, than to complicate the regex, into an unreadable symbol soup. (e.g. require that the file only use //
single line comments.)
What about "//comment-like strings inside quotes"
?
OP is asking how to do do it using regular expressions; so:
def remove_comments(string):
pattern = r"(\".*?\"|\'.*?\')|(/\*.*?\*/|//[^\r\n]*$)"
# first group captures quoted strings (double or single)
# second group captures comments (//single-line or /* multi-line */)
regex = re.compile(pattern, re.MULTILINE|re.DOTALL)
def _replacer(match):
# if the 2nd group (capturing comments) is not None,
# it means we have captured a non-quoted (real) comment string.
if match.group(2) is not None:
return "" # so we will return empty to remove the comment
else: # otherwise, we will return the 1st group
return match.group(1) # captured quoted-string
return regex.sub(_replacer, string)
This WILL remove:
/* multi-line comments */
// single-line comments
Will NOT remove:
String var1 = "this is /* not a comment. */";
char *var2 = "this is // not a comment, either.";
url = 'http://not.comment.com';
Note: This will also work for Javascript source.
re.sub
returns a string, so changing your code to the following will give results:
def removeComments(string):
string = re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurrences streamed comments (/*COMMENT */) from string
string = re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurrence single-line comments (//COMMENT\n ) from string
return string
I would suggest using a REAL parser like SimpleParse or PyParsing. SimpleParse requires that you actually know EBNF, but is very fast. PyParsing has its own EBNF-like syntax but that is adapted for Python and makes it a breeze to build powerfully accurate parsers.
Edit:
Here is an example of how easy it is to use PyParsing in this context:
>>> test = '/* spam * spam */ eggs'
>>> import pyparsing
>>> comment = pyparsing.nestedExpr("/*", "*/").suppress()
>>> print comment.transformString(test)
' eggs'
Here is a more complex example using single and multi-line comments.
Before:
/*
* multiline comments
* abc 2323jklj
* this is the worst C code ever!!
*/
void
do_stuff ( int shoe, short foot ) {
/* this is a comment
* multiline again!
*/
exciting_function(whee);
} /* extraneous comment */
After:
>>> print comment.transformString(code)
void
do_stuff ( int shoe, short foot ) {
exciting_function(whee);
}
It leaves an extra newline wherever it stripped comments, but that could be addressed.
Found another solution with pyparsing following Jathanism.
import pyparsing
test = """
/* Code my code
xx to remove comments in C++
or C or python */
include <iostream> // Some comment
int main (){
cout << "hello world" << std::endl; // comment
}
"""
commentFilter = pyparsing.cppStyleComment.suppress()
# To filter python style comment, use
# commentFilter = pyparsing.pythonStyleComment.suppress()
# To filter C style comment, use
# commentFilter = pyparsing.cStyleComment.suppress()
newtest = commentFilter.transformString(test)
print(newest)
Produces the following output:
include <iostream>
int main (){
cout << "hello world" << std::endl;
}
Can also use pythonStyleComment, javaStyleComment, cppStyleComment. Found it pretty useful.
I would recommend you read this page that has a quite detailed analyzis of the problem and gives a good understanding on why your approach doesn't work: http://ostermiller.org/findcomment.html
Short version: The regex you are looking for is this:
(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)
This should match both types of comment blocks. If you are having troubles following it read the page i linked.
I see several things you might want to revise.
First, Python passes objects by value, but some object types are immutable. Strings and integers are among these immutable types. So if you pass a string to a function, any changes to the string you make within the function won't affect the string you passed in. You should try returning a string instead. Furthermore, within the removeComments() function, you need to assign the value returned by re.sub() to a new variable -- like any function that takes a string as an argument, re.sub() will not modify the string.
Second, I would echo what others have said about parsing C code. Regular expressions are not the best way to go here.
You are doing it wrong.
Regex is for Regular Languages, which C isn't.
mystring="""
blah1 /* comments with
multiline */
blah2
blah3
// double slashes comments
blah4 // some junk comments
"""
for s in mystring.split("*/"):
s=s[:s.find("/*")]
print s[:s.find("//")]
output
$ ./python.py
blah1
blah2
blah3
As noted in one of my other comments, comment nesting isn't really the problem (in C, comments don't nest, though a few compilers to support nested comments anyway). The problem is with things like string literals, that can contain the exact same character sequence as a comment delimiter without actually being one.
As Mike Graham said, the right tool for the job is a lexer. A parser is unnecessary and would be overkill, but a lexer is exactly the right thing. As it happens, I posted a (partial) lexer for C (and C++) earlier this morning. It doesn't attempt to correctly identify all lexical elements (i.e. all keywords and operators) but it's entirely sufficient for stripping comments. It won't do any good on the "using Python" front though, as it's written entirely in C (it predates my using C++ for much more than experimental code).
Just want add another regex where we have to remove anything between * and ; in python
data = re.sub(re.compile("*.*?\;",re.DOTALL),' ',data)
there is backslash before * to escape the meta character.
This program removes comments with // and /* */ from the given file:
#! /usr/bin/python3
import sys
import re
if len(sys.argv)!=2:
exit("Syntax:python3 exe18.py inputfile.cc ")
else:
print ('The following files are given by you:',sys.argv[0],sys.argv[1])
with open(sys.argv[1],'r') as ifile:
newstring=re.sub(r'/\*.*?\*/',' ',ifile.read(),flags=re.S)
with open(sys.argv[1],'w') as ifile:
ifile.write(newstring)
print('/* */ have been removed from the inputfile')
with open(sys.argv[1],'r') as ifile:
newstring1=re.sub(r'//.*',' ',ifile.read())
with open(sys.argv[1],'w') as ifile:
ifile.write(newstring1)
print('// have been removed from the inputfile')
精彩评论