a few days ago I posted this question on the ANTLR mailinglist, but didn't recieve any support. So I'm hoping you guys here can help me out:
I am currently trying to dig into Antlr as I find this tool very helpful. The last Time I used it, I generated something based upon a finished grammar. This time I wanted to build my own grammar and really start understanding what's happening.
For this I decided to build a parser for some Wiki-Notation-Like text.
Here an example (without the -Start - and - End - row):
------------ Start ---------------
before
More before
And yet even more ...
[Lineup]
[Floor:Main Floor]
Test1
Test2
[Floor:Classics Floor]
Test3
Test4
Test5
Test6
[/Lineup]
after
more After
..
And even more.
------------ End ---------------
If the text contains a "Lineup" block, then this should be parsed. The content is at least one "Floor" followed by a number of Names, a new "Floor" or the closing "Lineup" I managed my parser to parse the text if I change my grammar and the text I am trying to parse to "[Floor:]" (One Block) but I really need a name in there :(
As soon as I change my Grammar to support the Floor-Name, nothing works anymore. Could you please help me with this? I'm not looking for someone that fixes it for me without a comment. I would really like to know why my grammar doesn't work. I'm really stuck and I'm working on this for days now (Ok ... I admit, it's just my spare time after work ... but at least all of that).
Here comes my gammar. If I try to parse the full text, I allways get EarlyExitExceptions while parsing the :( :
grammar CalendarEventsJava;
/*------------------------------------------------------------------
* PARSER RULES
*------------------------------------------------------------------*/
event : (
(LINE_CONTENT | NEWLINE)*
(lineup (LINE_CONTENT | NEWLINE)*)?
);
lineup : (LINEUP_OPEN NEWLINE floor+ LINEUP_CLOSE);
floor : (FLOOR_OPEN LINE_CONTENT FLOOR_CLOSE NEWLINE lineupEntry+);
lineupEntry
: (LINE_CONTENT? NEWLINE);
artist : LINE_CONTENT;
/*------------------------------------------------------------------
* LEXER RULES
*------------------------------------------------------------------*/
LINEUP_OPEN
: '[Lineup]';
LINEUP_CLOSE
: '[/Lineup]';
FLOOR_OPEN
: '[Floor:';
FLOOR_CLOSE
: ']';
BLANKS : ( ' ' | '\t' )+;
NONBREAKING
: ~('\r' | '\n' | ']');
NEWLINE : '\r'? '\n';
// the content of a line consists of at least one non-breaking character.
LINE_CONTENT
: 开发者_如何学运维 (NONBREAKING | ']')+ ;
I really hope you can help me, as I'm really anxious to really get started with ANTLR, cause I think it really rocks :)
Chris
The problem
If you examine the token stream after tokenizing your source, you'll see that the following tokens are fed to the parser:
LINEUP_OPEN :: [Lineup]
NEWLINE :: \n
LINE_CONTENT :: [Floor:Main Floor]
NEWLINE :: \n
LINE_CONTENT :: Test1
NEWLINE :: \n
LINE_CONTENT :: Test2
NEWLINE :: \n
LINE_CONTENT :: [Floor:Classics Floor]
NEWLINE :: \n
LINE_CONTENT :: Test3
NEWLINE :: \n
LINE_CONTENT :: Test4
NEWLINE :: \n
LINE_CONTENT :: Test5
NEWLINE :: \n
LINE_CONTENT :: Test6
NEWLINE :: \n
LINEUP_CLOSE :: [/Lineup]
As you can see, there is never a FLOOR_OPEN
created but LINE_CONTENT
tokens instead.
Here's how you can manually debug your token stream:
String source =
"[Lineup]\n" +
"[Floor:Main Floor]\n" +
"Test1\n" +
"Test2\n" +
"[Floor:Classics Floor]\n" +
"Test3\n" +
"Test4\n" +
"Test5\n" +
"Test6\n" +
"[/Lineup]";
ANTLRStringStream in = new ANTLRStringStream(source);
CalendarEventsJavaLexer lexer = new CalendarEventsJavaLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CalendarEventsJavaParser parser = new CalendarEventsJavaParser(tokens);
for(Object o : tokens.getTokens()) {
CommonToken t = (CommonToken)o;
System.out.println(parser.tokenNames[t.getType()] + " :: " + t.getText().replace("\n", "\\n"));
}
The solution
Changing:
FLOOR_OPEN
: '[Floor:';
to
FLOOR_OPEN : '[Floor:' ~']'* ']';
(FLOOR_CLOSE
can then be removed)
and changing:
NONBREAKING
: ~('\r' | '\n');
to:
NONBREAKING : ~('\r' | '\n' | '[' | ']');
will result in the following parse tree:
Comments
Note that the lexer rules NONBREAKING
and LINE_CONTENT
are very similar, you probably don't want NONBREAKING
to ever appear in the token stream. It's be better if you make NONBREAKING
a fragment-rule. Fragment rules are only used by other lexer rules and will therefor never be used to create a "real" token:
fragment NONBREAKING : ~('\r' | '\n' | '[' | ']');
LINE_CONTENT : NONBREAKING+;
It looks like
NONBREAKING
: ~('\r' | '\n');
is consuming the floor close. It will consume all characters up to the end of the line. Try excluding the floor close character from it.
Kate.
精彩评论