I am trying to parse C++/Java style source files and would like to isolate comments, string literals, and whitespaces as tokens.
For whitespaces and comments, the commonly suggested solution is (using ANTLR grammar):
// WS comment开发者_运维知识库s*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' (options {greedy=false;}: .)* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' (options {greedy=false;}: .)* '\r'? '\n' {$channel=HIDDEN;};
But, the problem is that my source files also consist of string literals e.g.
printf(" /* something looks like comment and whitespace \n");
printf(" something looks like comment and whitespace */ \n");
The whole thing inside "" should be considered a single token but my ANTLR lexer rules obviously will consider them a ML_COMMENT token:
/* something looks like comment and whitespace \n");
printf(" something looks like comment and whitespace */
But I cannot create another lexer rule to define a token as something inside a pair of " (assuming the \" escape sequence is handled properly), because this would be considered as a string token erroneously:
/* comment...."comment that looks */ /*like a string literal"...more comment */
In short, the 2 pairs /**/ and "" will interfere with one another because each can contain the start of the other as its valid content. So how should we define a lexer grammar to handle both cases?
JavaMan wrote:
I am trying to parse C++/Java style source files and would like to isolate comment, string literal, and whitespace as tokens.
Shouldn't you match char literals as well? Consider:
char c = '"';
The double quote should not be considered as the start of a string literal!
JavaMan wrote:
In short, the 2 pairs /**/ and "" will interfere with one another.
Err, no. If a /*
is "seen" first, it would consume all the way to the first */
. For input like:
/* comment...."comment that looks like a string literal"...more comment */
this would mean the double quotes are also consumed. The same for string literals: when a double quote is seen first, the /*
and/or */
would be consumed until the next (un-escaped) "
is encountered.
Or did I misunderstand?
Note that you can drop the options {greedy=false;}:
from your grammar before .*
or .+
which are by default ungreedy.
Here's a way:
grammar T;
parse
: (t=.
{
if($t.type != OTHER) {
System.out.printf("\%-10s >\%s<\n", tokenNames[$t.type], $t.text);
}
}
)+
EOF
;
ML_COMMENT
: '/*' .* '*/'
;
SL_COMMENT
: '//' ~('\r' | '\n')*
;
STRING
: '"' (STR_ESC | ~('\\' | '"' | '\r' | '\n'))* '"'
;
CHAR
: '\'' (CH_ESC | ~('\\' | '\'' | '\r' | '\n')) '\''
;
SPACE
: (' ' | '\t' | '\r' | '\n')+
;
OTHER
: . // fall-through rule: matches any char if none of the above matched
;
fragment STR_ESC
: '\\' ('\\' | '"' | 't' | 'n' | 'r') // add more: Unicode esapes, ...
;
fragment CH_ESC
: '\\' ('\\' | '\'' | 't' | 'n' | 'r') // add more: Unicode esapes, Octal, ...
;
which can be tested with:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"String s = \" foo \\t /* bar */ baz\";\n" +
"char c = '\"'; // comment /* here\n" +
"/* multi \"no string\"\n" +
" line */";
System.out.println(source + "\n-------------------------");
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
If you run the class above, the following is printed to the console:
String s = " foo \t /* bar */ baz";
char c = '"'; // comment /* here
/* multi "no string"
line */
-------------------------
SPACE > <
SPACE > <
SPACE > <
STRING >" foo \t /* bar */ baz"<
SPACE >
<
SPACE > <
SPACE > <
SPACE > <
CHAR >'"'<
SPACE > <
SL_COMMENT >// comment /* here<
SPACE >
<
ML_COMMENT >/* multi "no string"
line */<
Basically your problem is: inside a string literal, comments (/* and //) must be ignored, and vice versa. IMO this can only be tackled by sequential reading. when walking through the source file on a character-by-character basis, you might approach this as a state machine with the states Text, BlockComment, LineComment, StringLiteral.
This is a hard problem to try to solve with regex, or even a grammar.
Mind you, any C/C++/C#/Java lexer also needs to handle this exact same problem. I'm quite sure it employs a state machine-like solution. So what I'm suggesting is, if you can, customize your lexer in this way.
精彩评论