How do I lex this input?_问答_开发者_运维开发者技术经验分享

I currently have a working, simple language implemented in Java using ANTLR. What I want to do is embed it in plain text, in a similar fashion to PHP.

For example:

Lorem ipsum dolor sit amet
<% print('consectetur adipiscing elit'); %>
Phasellus vol开发者_StackOverflow社区utpat dignissim sapien.

I anticipate that the resulting token stream would look something like:

CDATA OPEN PRINT OPAREN APOS STRING APOS CPAREN SEMI CLOSE CDATA

How can I achieve this, or is there a better way?

There is no restriction on what might be outside the <% block. I assumed something like <% print('%>'); %>, as per Michael Mrozek's answer, would be possible, but outside of a situation like that, <% would always indicate the start of a code block.

Sample Implementation

I developed a solution based on ideas given in Michael Mrozek's answer, simulating Flex's start conditions using ANTLR's gated semantic predicates:

lexer grammar Lexer;

@members {
    boolean codeMode = false;
}

OPEN    : {!codeMode}?=> '<%' { codeMode = true; } ;
CLOSE   : {codeMode}?=> '%>' { codeMode = false;} ;
LPAREN  : {codeMode}?=> '(';
//etc.

CHAR    : {!codeMode}?=> ~('<%');


parser grammar Parser;

options {
    tokenVocab = Lexer;
    output = AST;
}

tokens {
    VERBATIM;
}

program :
    (code | verbatim)+
    ;   

code :
    OPEN statement+ CLOSE -> statement+
    ;

verbatim :
    CHAR -> ^(VERBATIM CHAR)
    ;

but outside of a situation like that, <% would always indicate the start of a code block.

In that case, first scan the file for your embedded code, and once you have those, parse your embedded code with a dedicated parser (without the noise before the <% and after the %> tags).

ANTLR has the option to let the lexer parse just a (small) part of an input file and ignore the rest. Note that you cannot create a "combined grammar" (parser and lexer in one) in that case. Here's how you can create such a "partial lexer":

// file EmbeddedCodeLexer.g
lexer grammar EmbeddedCodeLexer;

options{filter=true;} // <- enables the partial lexing!

EmbeddedCode
  :  '<%'                            // match an open tag
     (  String                       // ( match a string literal
     |  ~('%' | '\'')                //   OR match any char except `%` and `'`
     |  {input.LT(2) != '>'}?=> '%'  //   OR only match a `%` if `>` is not ahead of it
     )*                              // ) <- zero or more times
     '%>'                            // match a close tag
  ;

fragment
String
  :  '\'' ('\\' . | ~('\'' | '\\'))* '\''
  ;

If you now create a lexer from it:

java -cp antlr-3.2.jar org.antlr.Tool EmbeddedCodeLexer.g

and create a little test harness:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = "Lorem ipsum dolor sit amet       \n"+
                "<%                                       \n"+
                "a = 2 > 1 && 10 % 3;                     \n"+
                "print('consectetur %> adipiscing elit'); \n"+
                "%>                                       \n"+
                "Phasellus volutpat dignissim sapien.     \n"+
                "foo <% more code! %> bar                 \n";
        ANTLRStringStream in = new ANTLRStringStream(source);
        EmbeddedCodeLexer lexer = new EmbeddedCodeLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        for(Object o : tokens.getTokens()) {
            System.out.println("=======================================\n"+
                    "EmbeddedCode = "+((Token)o).getText());
        }
    }
}

compile it all:

javac -cp antlr-3.2.jar *.java

and finally run the Main class by doing:

// *nix/MacOS
java -cp .:antlr-3.2.jar Main

// Windows
java -cp .;antlr-3.2.jar Main

it will produce the following output:

=======================================
EmbeddedCode = <%                                       
a = 2 > 1 && 10 % 3;                     
print('consectetur %> adipiscing elit'); 
%>
=======================================
EmbeddedCode = <% more code! %>

The actual concept looks fine, although it's unlikely you'd have a PRINT token; the lexer would probably emit something like IDENTIFIER, and the parser would be responsible for figuring out that it's a function call (e.g. by looking for IDENTIFIER OPAREN ... CPAREN) and doing the appropriate thing.

As for how to do it, I don't know anything about ANTLR, but it probably has something like flex's start conditions. If so, you can have the INITIAL start condition do nothing but look for <%, which would switch to the CODE state where all the actual tokens are defined; then '%>' would switch back. In flex it would be:

%s CODE

%%

<INITIAL>{
    "<%"      {BEGIN(CODE);}
    .         {}
}

 /* All these are implicitly in CODE because it was declared %s,
    but you could wrap it in <CODE>{} too
  */
"%>"          {BEGIN(INITIAL);}
"("           {return OPAREN;}
"'"           {return APOS;}
...

You need to be careful about things like matching %> in a context where it's not a closing marker, like within a string; it's up to you if you want to allow <% print('%>'); %>, but most likely you do