How to specify 'greedy identifiers with a space' in ANTLR?_问答_开发者

Suppose we have the input that looks like the sequence of simple English statements, each on a separate line, like these:

Alice checks
Bob bets 100
Charlie raises 100
Alice folds

Let's try parsing it with this grammar:

actions: action* EOF;
action: player=name (check | call | raise | fold) NEWLINE;
check: 'checks';
call: 开发者_运维问答'calls' amount;
raise: 'raises' amount;
fold: 'folds';

name: /* The subject of this question */;
amount: '$'? INT;

INT: ('0'..'9')+;
NEWLINE: '\r'? '\n';

The number of different verbs is fixed, but what's interesting is that name that we are trying to match could have spaces in it - and verbs could potentially be parts of it, too! So the following input is valid:

Guy who always bets 100 checks
Guy who always checks bets 100
Guy who always calls folds
Guy who always folds raises 100
Guy who always checks and then raises bets by others calls $100

So the question is: how do we define name so it is greedy just enough to eat spaces and words that we are usually treating as verbs, but is not super-greedy so that the verbs could still be matched by action rule?

My first attempt at solving this task was looking like this:

name: WORD (S WORD)*;
WORD: ('a'..'z'|'A'..'Z'|'0'..'9')+; // Yes, 1234 is a WORD, too...
S: ' '; // We have to keep spaces in names

Unfortunately, this will not match 'Guy who always bets', since bets is not a WORD, but a different token, defined by a literal in bets rule. I wanted to get around that by creating a rule like keyword[String word], and making other rules match, say, keyword["bets"] instead of a literal, but that's where I got stuck. (I guess I could just list all my verbs as valid alternates to be a part of a name, but it just feels wrong.)

Here is what more: all the names are declared before they are used, so I can read them before I start parsing actions. And they can't be longer than MAX_NAME_LENGTH chars long. Can it be of any help here?

Maybe I'm doing it wrong, anyway. ANTLR gurus, can I hear from you?

The easy way out would be to enable global backtracking on your entire grammar. This is normally not recommendable, but I guess your grammar will stay relatively small, in which case it won't matter much on the run-time of your parser. If you do find it becomes slow, you could un-comment the memoize option which will make your parser faster, at the cost of some memory consumption.

A demo:

in.txt

Guy who always bets 100 checks
Guy who always checks bets 100
Guy who always calls folds
Guy who always folds raises 100
Guy who always checks and then raises bets by others calls $100

Poker.g

grammar Poker;

options {
  backtrack=true;
  // memoize=true;
}

actions
  :  action* EOF
  ;

action
  :  name SPACES (bets | calls | raises | CHECKS | FOLDS) SPACES? (NEWLINE | EOF)
     {
       System.out.println($name.text);
     }
  ;

bets    : BETS SPACES amount;
calls   : CALLS SPACES amount;
raises  : RAISES SPACES amount;
name    : anyWord (SPACES anyWord)*;
amount  : '$'? INT;
anyWord : BETS | FOLDS | CHECKS | CALLS | RAISES | INT | WORD; 

BETS    : 'bets';
FOLDS   : 'folds';
CHECKS  : 'checks';
CALLS   : 'calls';
RAISES  : 'raises';
WORD    : ('a'..'z' | 'A'..'Z')+;
INT     : '0'..'9'+;
SPACES  : ' '+;
NEWLINE : '\r'? '\n';

Main.java

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PokerLexer lexer = new PokerLexer(new ANTLRFileStream("in.txt"));
    PokerParser parser = new PokerParser(new CommonTokenStream(lexer));
    parser.actions();
  }
}

Running the Main class produces:

bart@hades:~/Programming/ANTLR/Demos/Poker$ java -cp antlr-3.3.jar org.antlr.Tool Poker.g 
bart@hades:~/Programming/ANTLR/Demos/Poker$ javac -cp antlr-3.3.jar *.java
bart@hades:~/Programming/ANTLR/Demos/Poker$ java -cp .:antlr-3.3.jar Main
Guy who always bets 100
Guy who always checks
Guy who always calls
Guy who always folds
Guy who always checks and then raises bets by others

EDIT

You could do it the other way around: negate the tokens that you don't want anyWord to match:

// other parser rules
anyWord : ~(SPACES | NEWLINE | DOLLAR); 

BETS    : 'bets';
FOLDS   : 'folds';
CHECKS  : 'checks';
CALLS   : 'calls';
RAISES  : 'raises';
WORD    : ('a'..'z' | 'A'..'Z')+;
INT     : '0'..'9'+;
DOLLAR  : '$';
SPACES  : ' '+;
NEWLINE : '\r'? '\n';

anyWord now matches any token except SPACES, NEWLINE and DOLLAR's. Note the difference between ~ inside lexer rules (negates characters) and parser rules (negates tokens!).

Simple solution: split on whitespace, reverse the input word-by-word, then parse from the right instead of from the left. (This requires rewriting your grammar, of course.)