I currently have a working, simple language implemented in Java using ANTLR. What I want to do is embed it in plain text, in a similar fashion to PHP.
For example:
Lorem ipsum dolor sit amet
<% print('consectetur adipiscing elit'); %>
Phasellus vol开发者_StackOverflow社区utpat dignissim sapien.
I anticipate that the resulting token stream would look something like:
CDATA OPEN PRINT OPAREN APOS STRING APOS CPAREN SEMI CLOSE CDATA
How can I achieve this, or is there a better way?
There is no restriction on what might be outside the <%
block. I assumed something like <% print('%>'); %>
, as per Michael Mrozek's answer, would be possible, but outside of a situation like that, <%
would always indicate the start of a code block.
Sample Implementation
I developed a solution based on ideas given in Michael Mrozek's answer, simulating Flex's start conditions using ANTLR's gated semantic predicates:
lexer grammar Lexer;
@members {
boolean codeMode = false;
}
OPEN : {!codeMode}?=> '<%' { codeMode = true; } ;
CLOSE : {codeMode}?=> '%>' { codeMode = false;} ;
LPAREN : {codeMode}?=> '(';
//etc.
CHAR : {!codeMode}?=> ~('<%');
parser grammar Parser;
options {
tokenVocab = Lexer;
output = AST;
}
tokens {
VERBATIM;
}
program :
(code | verbatim)+
;
code :
OPEN statement+ CLOSE -> statement+
;
verbatim :
CHAR -> ^(VERBATIM CHAR)
;
but outside of a situation like that, <% would always indicate the start of a code block.
In that case, first scan the file for your embedded code, and once you have those, parse your embedded code with a dedicated parser (without the noise before the <%
and after the %>
tags).
ANTLR has the option to let the lexer parse just a (small) part of an input file and ignore the rest. Note that you cannot create a "combined grammar" (parser and lexer in one) in that case. Here's how you can create such a "partial lexer":
// file EmbeddedCodeLexer.g
lexer grammar EmbeddedCodeLexer;
options{filter=true;} // <- enables the partial lexing!
EmbeddedCode
: '<%' // match an open tag
( String // ( match a string literal
| ~('%' | '\'') // OR match any char except `%` and `'`
| {input.LT(2) != '>'}?=> '%' // OR only match a `%` if `>` is not ahead of it
)* // ) <- zero or more times
'%>' // match a close tag
;
fragment
String
: '\'' ('\\' . | ~('\'' | '\\'))* '\''
;
If you now create a lexer from it:
java -cp antlr-3.2.jar org.antlr.Tool EmbeddedCodeLexer.g
and create a little test harness:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source = "Lorem ipsum dolor sit amet \n"+
"<% \n"+
"a = 2 > 1 && 10 % 3; \n"+
"print('consectetur %> adipiscing elit'); \n"+
"%> \n"+
"Phasellus volutpat dignissim sapien. \n"+
"foo <% more code! %> bar \n";
ANTLRStringStream in = new ANTLRStringStream(source);
EmbeddedCodeLexer lexer = new EmbeddedCodeLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object o : tokens.getTokens()) {
System.out.println("=======================================\n"+
"EmbeddedCode = "+((Token)o).getText());
}
}
}
compile it all:
javac -cp antlr-3.2.jar *.java
and finally run the Main class by doing:
// *nix/MacOS
java -cp .:antlr-3.2.jar Main
// Windows
java -cp .;antlr-3.2.jar Main
it will produce the following output:
=======================================
EmbeddedCode = <%
a = 2 > 1 && 10 % 3;
print('consectetur %> adipiscing elit');
%>
=======================================
EmbeddedCode = <% more code! %>
The actual concept looks fine, although it's unlikely you'd have a PRINT token; the lexer would probably emit something like IDENTIFIER, and the parser would be responsible for figuring out that it's a function call (e.g. by looking for IDENTIFIER OPAREN ... CPAREN
) and doing the appropriate thing.
As for how to do it, I don't know anything about ANTLR, but it probably has something like flex's start conditions. If so, you can have the INITIAL
start condition do nothing but look for <%
, which would switch to the CODE
state where all the actual tokens are defined; then '%>' would switch back. In flex it would be:
%s CODE
%%
<INITIAL>{
"<%" {BEGIN(CODE);}
. {}
}
/* All these are implicitly in CODE because it was declared %s,
but you could wrap it in <CODE>{} too
*/
"%>" {BEGIN(INITIAL);}
"(" {return OPAREN;}
"'" {return APOS;}
...
You need to be careful about things like matching %>
in a context where it's not a closing marker, like within a string; it's up to you if you want to allow <% print('%>'); %>
, but most likely you do
精彩评论