Suppose we have the input that looks like the sequence of simple English statements, each on a separate line, like these:
Alice checks
Bob bets 100
Charlie raises 100
Alice folds
Let's try parsing it with this grammar:
actions: action* EOF;
action: player=name (check | call | raise | fold) NEWLINE;
check: 'checks';
call: 开发者_运维问答'calls' amount;
raise: 'raises' amount;
fold: 'folds';
name: /* The subject of this question */;
amount: '$'? INT;
INT: ('0'..'9')+;
NEWLINE: '\r'? '\n';
The number of different verbs is fixed, but what's interesting is that name that we are trying to match could have spaces in it - and verbs could potentially be parts of it, too! So the following input is valid:
Guy who always bets 100 checks
Guy who always checks bets 100
Guy who always calls folds
Guy who always folds raises 100
Guy who always checks and then raises bets by others calls $100
So the question is: how do we define name
so it is greedy just enough to eat spaces and words that we are usually treating as verbs, but is not super-greedy so that the verbs could still be matched by action
rule?
My first attempt at solving this task was looking like this:
name: WORD (S WORD)*;
WORD: ('a'..'z'|'A'..'Z'|'0'..'9')+; // Yes, 1234 is a WORD, too...
S: ' '; // We have to keep spaces in names
Unfortunately, this will not match 'Guy who always bets', since bets
is not a WORD
, but a different token, defined by a literal in bets
rule. I wanted to get around that by creating a rule like keyword[String word]
, and making other rules match, say, keyword["bets"]
instead of a literal, but that's where I got stuck. (I guess I could just list all my verbs as valid alternates to be a part of a name
, but it just feels wrong.)
Here is what more: all the name
s are declared before they are used, so I can read them before I start parsing action
s. And they can't be longer than MAX_NAME_LENGTH chars long. Can it be of any help here?
Maybe I'm doing it wrong, anyway. ANTLR gurus, can I hear from you?
The easy way out would be to enable global backtracking on your entire grammar. This is normally not recommendable, but I guess your grammar will stay relatively small, in which case it won't matter much on the run-time of your parser. If you do find it becomes slow, you could un-comment the memoize option which will make your parser faster, at the cost of some memory consumption.
A demo:
in.txt
Guy who always bets 100 checks Guy who always checks bets 100 Guy who always calls folds Guy who always folds raises 100 Guy who always checks and then raises bets by others calls $100
Poker.g
grammar Poker;
options {
backtrack=true;
// memoize=true;
}
actions
: action* EOF
;
action
: name SPACES (bets | calls | raises | CHECKS | FOLDS) SPACES? (NEWLINE | EOF)
{
System.out.println($name.text);
}
;
bets : BETS SPACES amount;
calls : CALLS SPACES amount;
raises : RAISES SPACES amount;
name : anyWord (SPACES anyWord)*;
amount : '$'? INT;
anyWord : BETS | FOLDS | CHECKS | CALLS | RAISES | INT | WORD;
BETS : 'bets';
FOLDS : 'folds';
CHECKS : 'checks';
CALLS : 'calls';
RAISES : 'raises';
WORD : ('a'..'z' | 'A'..'Z')+;
INT : '0'..'9'+;
SPACES : ' '+;
NEWLINE : '\r'? '\n';
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
PokerLexer lexer = new PokerLexer(new ANTLRFileStream("in.txt"));
PokerParser parser = new PokerParser(new CommonTokenStream(lexer));
parser.actions();
}
}
Running the Main class produces:
bart@hades:~/Programming/ANTLR/Demos/Poker$ java -cp antlr-3.3.jar org.antlr.Tool Poker.g bart@hades:~/Programming/ANTLR/Demos/Poker$ javac -cp antlr-3.3.jar *.java bart@hades:~/Programming/ANTLR/Demos/Poker$ java -cp .:antlr-3.3.jar Main Guy who always bets 100 Guy who always checks Guy who always calls Guy who always folds Guy who always checks and then raises bets by others
EDIT
You could do it the other way around: negate the tokens that you don't want anyWord
to match:
// other parser rules
anyWord : ~(SPACES | NEWLINE | DOLLAR);
BETS : 'bets';
FOLDS : 'folds';
CHECKS : 'checks';
CALLS : 'calls';
RAISES : 'raises';
WORD : ('a'..'z' | 'A'..'Z')+;
INT : '0'..'9'+;
DOLLAR : '$';
SPACES : ' '+;
NEWLINE : '\r'? '\n';
anyWord
now matches any token except SPACES
, NEWLINE
and DOLLAR
's. Note the difference between ~
inside lexer rules (negates characters) and parser rules (negates tokens!).
Simple solution: split on whitespace, reverse the input word-by-word, then parse from the right instead of from the left. (This requires rewriting your grammar, of course.)
精彩评论