How to transform a production to LL(1) for a list separated by a semicolon?_问答_开发者

I'm reading this introductory book on parsing (which is pretty good btw) and one of the exercice is to "build a parser for your favorite language." Since I don't want to die today, I thought I could do a parser for something relatively simple, ie a simplified CSS.

Note: This book teach you how to write a LL(1) parser using the recursive-descent algorithm.

So, as a sub-exercice, I am building the grammar from what I know of CSS. But I'm stuck on a production that I can't transform in LL(1) :

//EBNF
block = "{", declaration, {";", declaration}, [";"], "}"

//BNF
<block> =:: "{" <declaration> "}"
<declaration> =:: <single-declaration> <opt-end> | <single-declaration> ";" <declaration>
<opt-end> =:: "" | ";"

This describe a CSS block. Valid block can have the form :

{ property : value }
{ property : value; }
{ property : value; property : value }
{ property : value; property : value; }
...

The problem is with the optional ";" at the end, because it overlap with the startin开发者_StackOverflowg character of {";", declaration}, so when my parser meet a semicolon in this context, it doesn't know what to do.

The book talk about this problem, but in its example, the semicolon is obligatory, so the rule can be modified like this :

block = "{", declaration, ";", {declaration, ";"}, "}"

So, Is it possible to achieve what I'm trying to do using a LL(1) parser?

I think I got it:

//EBNF 
block = "{", decl, "}"
decl = simple-decl, [";", [decl]]
simple-decl = ...

//BNF
<block> =:: "{" <decl> "}"
<decl> =:: <simple-decl> <decl-end>
<decl-end> =:: ";" <decl-tail> | ""
<decl-tail> =:: <decl> | e

This produce the following code :

private function parseBlock():void {
    accept(Token.LBRACE);
    parseDecl();
    accept(Token.RBRACE);
}


//Token.IDENTIFIER is the starting token of a declaration
private function parseDecl():void {
    accept(Token.IDENTIFIER);
    if(_currentToken.kind == Token.SEMICOLON){
        accept(Token.SEMICOLON);
        if(_currentToken.kind == Token.IDENTIFIER){
            parseDecl();
        }
    }
}

I'm I right?

If you allow null declarations, you can remove the ambiguity:

//EBNF
block = "{", declaration, {";", declaration}, "}"
declaration = "" | ...

//BNF
<block> =:: "{" <declaration-list> "}"
<declaration-list> =:: <declaration> | <declaration> ";" <declaration-list>
<declaration> =:: "" | ...

though this isn't a requirement:

// these don't allow empty declarations
block = "{", {declaration, ";"}, end-declaration, "}"
end-declaration = "" | declaration


<block> =:: "{" <declaration-list> "}"
<declaration-list> =:: "" | <declaration> | <declaration> ";" <declaration-list>

To handle nulls, figure out what terminals may come after the null, and have your parser recognize those. For recursive descent (in not-quite-java):

/*
<block> =:: "{" <declaration-list> "}"
<declaration-list> =:: <declaration> | <declaration> ";" <declaration-list>
<declaration> =:: "" | ...
*/
Node block(tokens) {
    terminal(BLOCKBEGIN, tokens);
    Node decls = declList(tokens);
    terminal(BLOCKEND, tokens);
}
Node declList(tokens) {
    Node dcl = decl(tokens);
    if (expect(LINESEP, tokens)) {
        terminal(LINESEP, tokens);
        return new DeclarationList(dcl, declList(tokens));
    } else {
        return new DeclarationList(dcl);
    }
}
Node decl(tokens) {
    if (expect(BLOCKEND, tokens)) {
        return new Declaration();
    }
    ...
}
/*
<block> =:: "{" <declaration-list> "}"
<declaration-list> =:: "" | <declaration> | <declaration> ";" <declaration-list>
*/
Node block(tokens) {
    terminal(BLOCKBEGIN, tokens);
    Node decls = declList(tokens);
    terminal(BLOCKEND, tokens);
}
Node declList(tokens) {
    if (expect(BLOCKEND, tokens)) {
        return new DeclarationList();
    }
    Node dcl = decl(tokens);
    if (expect(LINESEP, tokens)) {
        terminal(LINESEP, tokens);
        return new DeclarationList(dcl, declList(tokens));
    } else {
        return new DeclarationList(dcl);
    } 
}

For a top-down parser, the process is a little more explicit. When constructing the FOLLOWS relation, you recursively replace nulls with whatever follows the non-terminal that generated the null.

A → B C
B → b
C → D E
D → d | ""
E → e

FOLLOWS(B) ← FIRST(C) = FIRST(D) = {d, ""}
          += - {""} + FOLLOWS(D) = - {""} + FIRST(E) = - {""} + {e}
FOLLOWS(B) = {d, e}

You then fill out the parsing table as normal.