开发者

Tracking down problems with text being ignored by ANTLR parser

开发者 https://www.devze.com 2023-04-02 17:11 出处:网络
I\'m working on a parser that will split a string containing a person\'s full name into components (first, middle, last, title, suffix, ...). When I try a basic example \"J. A. Doe\" in ANTLRWorks, it

I'm working on a parser that will split a string containing a person's full name into components (first, middle, last, title, suffix, ...). When I try a basic example "J. A. Doe" in ANTLRWorks, it matches the fname and lname rules, but ignores the "A.". How do I troubleshoot this type of problem?

Tracking down problems with text being ignored by ANTLR parser

Stripped-down grammar:

grammar PersonNamesMinimal;

fullname returns [Name name]
 : (directory_style[name] | standard[name] | proper_initials[name]);

fullname_only returns [Name name]: f=fullname EOF;

standard[Name name]
 : fname[name] ' ' (mname[name] ' ')* lname[name] ;

proper_initials[Name name]: a=INITIAL ' '? b=INITIAL lname[name];

sep: ',' | ', ' | ' ';
dir_sep: ',' | ', ' | ' , ';

directory_style[Name name]
 : lname[name] dir_sep fname[name] (' ' mname[name])*;

fname[Name name] : (f=NAME | f=INITIAL);

mname[Name name] : (m=NAME | m=INITIAL); // Weird bug when mname is "F."

lname[Name name] : a=single_lname (b='-' c=single_lname)?;
single_lname returns [String s]
 : (p=LNAME_PREFIX r=NAME)
 | r=NAME;
LNAME_PREFIX : (V O N | V A N ' ' D E R | V A N ' ' D E N | V A N | D E ' ' L A | D E | B I N) ' ';

O_APOS: ('O'|'o') '\'';
NAME: (O_APOS? LETTER LETTER+) | LETTER;
INITIAL: LETTER '.';

AND: ( ' '+ A N D ' '+ ) | (' '* '&' ' '*);
fragment WORD : LETTER+;
COMMA : ',';
//WS : ( '\t' | ' ' );

fragment DIGIT : '0' .. '9';
fragment LETTER : 'A' .. 'Z' | 'a' .. 'z';

//{{{ fragments for each letter of alphabet
fragment A : 'A' | 'a';
fragment B : 'B' | 'b';
fragment C : 'C' | 'c';
fragment D : 'D' | 'd';
fragment E : 'E' 开发者_如何学Go| 'e';
fragment F : 'F' | 'f';
fragment G : 'G' | 'g';
fragment H : 'H' | 'h';
fragment I : 'I' | 'i';
fragment J : 'J' | 'j';
fragment K : 'K' | 'k';
fragment L : 'L' | 'l';
fragment M : 'M' | 'm';
fragment N : 'N' | 'n';
fragment O : 'O' | 'o';
fragment P : 'P' | 'p';
fragment Q : 'Q' | 'q';
fragment R : 'R' | 'r';
fragment S : 'S' | 's';
fragment T : 'T' | 't';
fragment U : 'U' | 'u';
fragment V : 'V' | 'v';
fragment W : 'W' | 'w';
fragment X : 'X' | 'x';
fragment Y : 'Y' | 'y';
fragment Z : 'Z' | 'z';
//}}}

In creating this stripped-down version I discovered that removing either the directory_style rule or the LNAME_PREFIX rule causes the mname rule to work as expected, but I'm not sure why.


The problem is not with your parser rules, at least, not the problem you're facing at the moment... :). There's something going wrong in the lexer.

The initial A. from the input "J. A. Doe" is not being tokenized as an INITIAL but the lexer tries to create an AND token from it (note the space before the 'A'!). You can see this by parsing the input "J. X. Doe" instead, with the even more trimmed grammar:

grammar PersonNamesMinimal;

// just parse zero or more tokens (no matter what) and print their type and text
parse
  :  (t=. {System.out.printf("\%-25s \%s\n", tokenNames[$t.type], $t.text);})* EOF
  ;


LNAME_PREFIX : (V O N | V A N ' ' D E R | V A N ' ' D E N | V A N | D E ' ' L A | D E | B I N) ' ';
O_APOS       : ('O'|'o') '\'';
NAME         : (O_APOS? LETTER LETTER+) | LETTER;
INITIAL      : LETTER '.';
AND          : ( ' '+ A N D ' '+ ) | (' '* '&' ' '*);
COMMA        : ',';

fragment LETTER : 'A' .. 'Z' | 'a' .. 'z';

fragment A : 'A' | 'a';
fragment B : 'B' | 'b';
fragment C : 'C' | 'c';
fragment D : 'D' | 'd';
fragment E : 'E' | 'e';
fragment F : 'F' | 'f';
fragment G : 'G' | 'g';
fragment H : 'H' | 'h';
fragment I : 'I' | 'i';
fragment J : 'J' | 'j';
fragment K : 'K' | 'k';
fragment L : 'L' | 'l';
fragment M : 'M' | 'm';
fragment N : 'N' | 'n';
fragment O : 'O' | 'o';
fragment P : 'P' | 'p';
fragment Q : 'Q' | 'q';
fragment R : 'R' | 'r';
fragment S : 'S' | 's';
fragment T : 'T' | 't';
fragment U : 'U' | 'u';
fragment V : 'V' | 'v';
fragment W : 'W' | 'w';
fragment X : 'X' | 'x';
fragment Y : 'Y' | 'y';
fragment Z : 'Z' | 'z';

SPACE : ' ';

with the class:

import org.antlr.runtime.*;

public class Main {
  public static void main(String[] args) throws Exception {
    PersonNamesMinimalLexer lexer = new PersonNamesMinimalLexer(new ANTLRStringStream(args[0]));
    PersonNamesMinimalParser parser = new PersonNamesMinimalParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

And then generate a lexer & parser, compile it all and then run Main with "J. X. Doe" as a command line parameter:

java -cp antlr-3.3.jar org.antlr.Tool PersonNamesMinimal.g 
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main "J. X. Doe"

which prints the following on your console:

INITIAL                   J.
SPACE                      
INITIAL                   X.
SPACE                      
NAME                      Doe

(.ie. the expected output)

But now provide "J. A. Doe":

java -cp .:antlr-3.3.jar Main "J. A. Doe"

and the following output is produced:

line 1:4 mismatched character '.' expecting set null
INITIAL                   J.
SPACE                      
NAME                      Doe

If you now comment the rule AND in your lexer:

...
INITIAL      : LETTER '.';
//AND          : ( ' '+ A N D ' '+ ) | (' '* '&' ' '*);
COMMA        : ',';
...

and test "J. A. Doe" again:

java -cp antlr-3.3.jar org.antlr.Tool PersonNamesMinimal.g 
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main "J. A. Doe"

you will see this:

INITIAL                   J.
SPACE                      
INITIAL                   A.
SPACE                      
NAME                      Doe

(i.e. all goes well!)


How to fix it? If I were you, I'd first get the lexer much cleaner by removing all the literal spaces and put them on the HIDDEN channel so you won't have to account for them inside other parser- and lexer rules:

SPACE
  :  (' ' | '\t' | '\r' | '\n')+ {$channel=HIDDEN;}
  ; 

That will at least solve this current problem you're facing. But there will probably be more...


EDIT

bemace wrote:

How would I modify the AND rule then so that it only matches whole words and not things like "stand"?

You don't need to do anything special for that to happen. As long as you have a rule that matches "stand", or even "andre", then they will not be tokenized as AND. In your case, NAME will match both of them, and because NAME matches more characters than AND for the input "stand" and "andre", they will become NAME tokens.

This is how ANTLR's lexer work: the longest match is chosen, and if two rules match the same number of characters, the rule that is first defined gets precedence of the other rule.

A small test:

grammar PersonNamesMinimal;

parse
  :  (t=. {System.out.printf("\%-25s \%s\n", tokenNames[$t.type], $t.text);})* EOF
  ;

AND
  :  A N D
  |  '&'
  ;

LNAME_PREFIX 
  :  V O N 
  |  V A N SPACES D E R 
  |  V A N SPACES D E N 
  |  V A N 
  |  D E SPACES L A 
  |  D E 
  |  B I N
  ;

INITIAL
  :  LETTER '.'
  ;

NAME
  :  (O '\'')? LETTER+ 
  ;

COMMA
  :  ','
  ;

SPACE 
  :  (' ' | '\t') {$channel=HIDDEN;}
  ;

fragment LETTER : 'A' .. 'Z' | 'a' .. 'z';
fragment SPACES : (' ' | '\t')+;
fragment A : 'A' | 'a';
fragment B : 'B' | 'b';
fragment C : 'C' | 'c';
fragment D : 'D' | 'd';
fragment E : 'E' | 'e';
fragment F : 'F' | 'f';
fragment G : 'G' | 'g';
fragment H : 'H' | 'h';
fragment I : 'I' | 'i';
fragment J : 'J' | 'j';
fragment K : 'K' | 'k';
fragment L : 'L' | 'l';
fragment M : 'M' | 'm';
fragment N : 'N' | 'n';
fragment O : 'O' | 'o';
fragment P : 'P' | 'p';
fragment Q : 'Q' | 'q';
fragment R : 'R' | 'r';
fragment S : 'S' | 's';
fragment T : 'T' | 't';
fragment U : 'U' | 'u';
fragment V : 'V' | 'v';
fragment W : 'W' | 'w';
fragment X : 'X' | 'x';
fragment Y : 'Y' | 'y';
fragment Z : 'Z' | 'z';

And if you now parse the input:

"Andre and stand van     der"

you will see the expected tokens being created:

java -cp .:antlr-3.3.jar Main "Andre and stand van     der"

NAME                      Andre
AND                       and
NAME                      stand
LNAME_PREFIX              van     der
0

精彩评论

暂无评论...
验证码 换一张
取 消