Parsing a code block with EBNF expression_问答_开发者

I am using CocoR to generate a java-like scanner/parser:

I'm having some troubles in creating a EBNF expression to match a codeblock:

I'm assuming a code block is surrounded by two well-known tokens: <& and &> example:

public method(int a, int b) <&  
various code  
&>

If I define a nonterminal symbol

codeblock = "<&" {ANY} "&>"

If the code inside the two symbols contains a '<' character the generated compiler will not handle it thus giving a syntax error.

Any hint?

Edit:

COMPILER JavaLike
CHARACTERS

nonZeroDigit  = "123456789".
digit         = '0' + nonZeroDigit .
letter        = 'A' .. 'Z' + 'a' .. 'z' + '_' + '$'.

TOKENS
ident = letter { letter | digit }.

PRODUCTIONS
JavaLike = {ClassDeclaration}.
ClassDeclaration ="class" ident ["extends" ident] "{" {VarDeclaration} {MethodDeclaration }"}" .
MethodDeclaration ="public" Type ident "("ParamList")" CodeBlock.
Codeblock = "<&" {ANY} "&>".

I have omitted some pro开发者_如何学JAVAductions for the sake of simplicity.

This is my actual implementation of the grammar. The main bug is that it fails if the code in the block contains one of the symbols '>' or '&'.

Nick, late to the party here ...

A number of ways to do this:

Define tokens for <& and &> so the lexer knows about them.

You may be able to use a COMMENTS directive

COMMENTS FROM <& TO &> - quoted as CoCo expects.

Or make hack NextToken() in your scanner.frame file. Do something like this (pseudo-code):

if (Peek() == CODE_START)
{
     while (NextToken() != CODE_END)
     {
        // eat tokens
     }
}

Or can override the Read() method in the Buffer and eat at the lowest level.

HTH

~~You can expand the ANY term to include <&, &>, and another nonterminal (call it ANY_WITHIN_BLOCK say).~~

Then you just use

ANY = "<&" | {ANY_WITHIN_BLOCK} | "&>" codeblock = "<&" {ANY_WITHIN_BLOCK} "&>"

~~And then the meaning of {ANY} is unchanged if you really need it later.~~

Okay, I didn't know anything about CocoR and gave you a useless answer, so let's try again.

As I started to say later in the comments, I feel that the real issue is that your grammar might be too loose and not sufficiently well specified.

When I wrote the CFG for the one language I've tried to create, I ended up using a sort of "meet-in-the-middle" approach: I wrote the top-level structure AND the immediate low-level combinations of tokens first, and then worked to make them meet in the mid-level (at about the level of conditionals and control flow, I guess).

You said this language is a bit like Java, so let me just show you the first lines I would write as a first draft to describe its grammar (in pseudocode, sorry. Actually it's like yacc/bison. And here, I'm using your brackets instead of Java's):

/* High-level stuff */

program: classes

classes: main-class inner-classes

inner-classes: inner-classes inner-class
             | /* empty */

main-class: class-modifier "class" identifier class-block

inner-class: "class" identifier class-block

class-block: "<&" class-decls "&>"

class-decls: field-decl
           | method

method: method-signature method-block

method-block: "<&" statements "&>"

statements: statements statement
          | /* empty */

class-modifier: "public"
              | "private"

identifier: /* well, you know */

And at the same time as you do all that, figure out your immediate token combinations, like for example defining "number" as a float or an int and then creating rules for adding/subtracting/etc. them.

I don't know what your approach is so far, but you definitely want to make sure you carefully specify everything and use new rules when you want a specific structure. Don't get ridiculous with creating one-to-one rules, but never be afraid to create a new rule if it helps you organize your thoughts better.