Grammar To Parse Multiline Comments and String Literals Simultaneously_问答_开发者

I am trying to parse C++/Java style source files and would like to isolate comments, string literals, and whitespaces as tokens.

For whitespaces and comments, the commonly suggested solution is (using ANTLR grammar):

// WS comment开发者_运维知识库s*****************************
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
ML_COMMENT: '/*' (options {greedy=false;}: .)* '*/' {$channel=HIDDEN;};
SL_COMMENT: '//' (options {greedy=false;}: .)* '\r'? '\n' {$channel=HIDDEN;};

But, the problem is that my source files also consist of string literals e.g.

printf("   /* something looks like comment and whitespace \n");
printf("    something looks like comment and whitespace */ \n");

The whole thing inside "" should be considered a single token but my ANTLR lexer rules obviously will consider them a ML_COMMENT token:

    /* something looks like comment and whitespace \n");
printf("    something looks like comment and whitespace */

But I cannot create another lexer rule to define a token as something inside a pair of " (assuming the \" escape sequence is handled properly), because this would be considered as a string token erroneously:

/*  comment...."comment that looks */   /*like a string literal"...more comment */

In short, the 2 pairs /**/ and "" will interfere with one another because each can contain the start of the other as its valid content. So how should we define a lexer grammar to handle both cases?

JavaMan wrote:

I am trying to parse C++/Java style source files and would like to isolate comment, string literal, and whitespace as tokens.

Shouldn't you match char literals as well? Consider:

char c = '"';

The double quote should not be considered as the start of a string literal!

JavaMan wrote:

In short, the 2 pairs /**/ and "" will interfere with one another.

Err, no. If a /* is "seen" first, it would consume all the way to the first */. For input like:

/*  comment...."comment that looks like a string literal"...more comment */

this would mean the double quotes are also consumed. The same for string literals: when a double quote is seen first, the /* and/or */ would be consumed until the next (un-escaped) " is encountered.

Or did I misunderstand?

Note that you can drop the options {greedy=false;}: from your grammar before .* or .+ which are by default ungreedy.

Here's a way:

grammar T;

parse
  :  (t=. 
       {
         if($t.type != OTHER) {
           System.out.printf("\%-10s >\%s<\n", tokenNames[$t.type], $t.text);
         }
       }
     )+
     EOF
  ;

ML_COMMENT
  :  '/*' .* '*/'
  ;

SL_COMMENT
  :  '//' ~('\r' | '\n')*
  ;

STRING
  :  '"' (STR_ESC | ~('\\' | '"' | '\r' | '\n'))* '"'
  ;

CHAR
  :  '\'' (CH_ESC | ~('\\' | '\'' | '\r' | '\n')) '\''
  ;

SPACE
  :  (' ' | '\t' | '\r' | '\n')+
  ;

OTHER
  :  . // fall-through rule: matches any char if none of the above matched
  ;

fragment STR_ESC
  :  '\\' ('\\' | '"' | 't' | 'n' | 'r') // add more:  Unicode esapes, ...
  ;

fragment CH_ESC
  :  '\\' ('\\' | '\'' | 't' | 'n' | 'r') // add more: Unicode esapes, Octal, ...
  ;

which can be tested with:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    String source = 
        "String s = \" foo \\t /* bar */ baz\";\n" +
        "char c = '\"'; // comment /* here\n" +
        "/* multi \"no string\"\n" +
        "   line */";
    System.out.println(source + "\n-------------------------");
    TLexer lexer = new TLexer(new ANTLRStringStream(source));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

If you run the class above, the following is printed to the console:

String s = " foo \t /* bar */ baz";
char c = '"'; // comment /* here
/* multi "no string"
   line */
-------------------------

SPACE      > <
SPACE      > <
SPACE      > <
STRING     >" foo \t /* bar */ baz"<
SPACE      >
<
SPACE      > <
SPACE      > <
SPACE      > <
CHAR       >'"'<
SPACE      > <
SL_COMMENT >// comment /* here<
SPACE      >
<
ML_COMMENT >/* multi "no string"
   line */<

Basically your problem is: inside a string literal, comments (/* and //) must be ignored, and vice versa. IMO this can only be tackled by sequential reading. when walking through the source file on a character-by-character basis, you might approach this as a state machine with the states Text, BlockComment, LineComment, StringLiteral.

This is a hard problem to try to solve with regex, or even a grammar.

Mind you, any C/C++/C#/Java lexer also needs to handle this exact same problem. I'm quite sure it employs a state machine-like solution. So what I'm suggesting is, if you can, customize your lexer in this way.