JavaCC: How can one exclude a string from a token? (A.k.a. understanding token ambiguity.)_问答_开发者

I had already many problems with understanding, how ambiguous tokens can be handled elegantly (or somehow at all) in JavaCC. Let's take this example:

I want to parse XML processing instruction.

The format is: "<?" <target> <data> "?>": target is an XML name, data can be anything except ?>, because it's the closing tag.

So, lets define this in JavaCC:

(I use lexical states, in this case DE开发者_Python百科FAULT and PROC_INST)

TOKEN : <#NAME : (very-long-definition-from-xml-1.1-goes-here) >
TOKEN : <WSS : (" " | "\t")+ >   // WSS = whitespaces
<DEFAULT> TOKEN : {<PI_START : "<?" > : PROC_INST}
<PROC_INST> TOKEN : {<PI_TARGET : <NAME> >}
<PROC_INST> TOKEN : {<PI_DATA : ~[] >}   // accept everything
<PROC_INST> TOKEN : {<PI_END : "?>" > : DEFAULT}

Now the part which recognizes processing instructions:

void PROC_INSTR() : {} {
(
    <PI_START>
    (t=<PI_TARGET>){System.out.println("target: " + t.image);}
    <WSS>
    (t=<PI_DATA>){System.out.println("data: " + t.image);}
    <PI_END>
) {}
}

Let's test it with <?mytarget here-goes-some-data?>:

The target is recognized: "target: mytarget". But now I get my favorite JavaCC parsing error:

!!  procinstparser.ParseException: Encountered "" at line 1, column 15.
!!  Was expecting one of:
!!

Encountered nothing? Was expecting nothing? Or what? Thank you, JavaCC!

I know, that I could use the MORE keyword of JavaCC, but this would give me the whole processing instruction as one token, so I'd had to parse/tokenize it further by myself. Why should I do that? Am I writing a parser that does not parse?

The problem is (i guess): hence <PI_DATA> recognizes "everything", my definition is wrong. I should tell JavaCC to recognize "everything except ?>" as processing instruction data.

But how can it be done?

NOTE: I can only exclude single characters using ~["a"|"b"|"c"], I can't exclude strings such as ~["abc"] or ~["?>"]. Another great anti-feature of JavaCC.

Thank you.

A word about the tokenizer

The tokenizer (*TokenManager) matches as many input characters as possible. PI_DATA is "~[]" (1 character), so it will match any single input character if it cannot find a longer match. PI_END is "?>" (2 characters), so it will always be matched instead of PI_DATA. This part of your grammar is correct.

An unexpected suspect

The trouble may actually come from NAME. You didn't write the actual definition of that token, so I can only make assumptions about it. If the definition of NAME is too greedy, it will match too many input characters in the state PROC_INST, and you may never encounter PI_DATA or PI_END.

Watch out for a "(...)+" with white spaces, or the evil "(~[])*" that eats everything up to EOF.

Other suspects

A potential problem I see is that PI_TARGET will probably be matched several times, though you would expect PI_DATA to be matched. Once again, I can only guess because I don't have the definition of NAME.

Another point you might want to clarify is this: you define the WSS token, but you don't use it in the state PROC_INST. Should it be a part of PI_DATA? If not, you may want to SKIP it.

Don't abuse the tokenizer

If you find out you cannot make the tokenizer obey you, you may want to move the tricky part to the parser instead. In your case, it's probably difficult to make the difference between PI_TARGET and PI_DATA (as mentioned above).

The parser can expect a PI data after a PI target, while the tokenizer cannot (or hardly) have expectations from a token to the next.

Another advantage of the parser is that you can even write Java code that peeks the next tokens and react accordingly. This should be considered as the last resort, but can be useful when you must do things such as concatenating multiple tokens up to a well-known one. This may be what you're looking for here (with PI_END as the terminator token).

Finally, a trick

Here is a trick to simplify your grammar a bit:

Skip PI_START, but change the state to PROC_INST nevertheless
In PROC_INST, define PI_DATA as MORE (and rename it to PI_DATA_CHAR, or just don't name it at all)
In PROC_INST, remove the last two characters from the token image, issue PI_DATA and change the state to DEFAULT
In your parser productions, define a processing instruction simply as , where the token image of PI_DATA is ready-to-use

Details about manipulating the token image in the tokenizer actions are provided in JavaCC's (sparse...) documentation. It's as easy as setting the length of a StringBuffer.

One problem with your grammar is that WSS applies only in the default state. Rewrite as

<DEFAULT, PROC_INST> TOKEN : {< WSS: (" " | "\t")+ > \}

The error message is that it was expecting a WSS but found a " ".

As to excluding whole strings, there are several ways to do this outlined in the FAQ.

JavaCC: How can one exclude a string from a token? (A.k.a. understanding token ambiguity.)

A word about the tokenizer

An unexpected suspect

Other suspects

Don't abuse the tokenizer

Finally, a trick

精彩评论

关注公众号

热门标签

图文推荐

JavaCC: How can one exclude a string from a token? (A.k.a. understanding token ambiguity.)

A word about the tokenizer

An unexpected suspect

Other suspects

Don't abuse the tokenizer

Finally, a trick

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：