开发者

How to replace macros with a grammar-based parser?

开发者 https://www.devze.com 2023-04-04 17:06 出处:网络
I need a parser for an exotic programming language. I wrote a grammar for it and used a parser generator (PEGjs) to generate the parser. That works perfectly... except for one thing: macros (that repl

I need a parser for an exotic programming language. I wrote a grammar for it and used a parser generator (PEGjs) to generate the parser. That works perfectly... except for one thing: macros (that replace a placeholder with predefined text). I don't know how to integrate this into a grammar. Let me illustrate the problem:

An example program to be parsed typically looks like th开发者_开发问答is:

instructionA parameter1, parameter2
instructionB parameter1
instructionC parameter1, parameter2, parameter3

No problem so far. But the language also supports macros:

Define MacroX { foo, bar }
instructionD parameter1, MacroX, parameter4

Define MacroY(macroParameter1, macroParameter2) {
  instructionE parameter1, macroParameter1
  instructionF macroParameter2, MacroX
}

instructionG parameter1, MacroX
MacroY

Of course I could define a grammar to identify Macros and references to Macros. But in that case I don't know how I would parse the contents of a Macro, because it's not clear what the macro contains. It could be just one parameter (that's easiest), but it could also be several parameters in one macro (like MacroX in my example, which represents two parameters) or a whole block of instructions (like MacroY). And Macros can even contain other Macros. How do I put this into a grammar if it's not clear what the macro is semantically?

The easiest approach seems to be to run a preprocessor first to replace all the macros and only then run the parser. But in that case the line numbers get messed up. I want the parser to generate error messages containing the line number if there is a parse error. And if I preprocess the input, the line numbers do not correspond anymore.

Help very much appreciated.


Macro processors tend not to respect the boundaries of language elements; in essence, they (often) can make arbitrary changes to the apparant input string.

If this is the case, you have little choice: you'll need to build a macro processor, that can preserve the line numbers.

If the macros always contain well-structured language elements, and they always occur in structured places in the code, then you can add the notion of a macro definition and call to your grammar. This may make your parses ambiguous; foo(x) in C code might be macro call, or it might be a function call. You'll have to resolve that ambiguity somehow. C parsers used to solve such ambiguity problems by collecting symbol table information as they parsed; if you collect is-foo-a-macro as you parse, then you can determine that foo(x) is a macro call or not.


With PEG you have to manually define the places where you can check for macro extensions. You can add your macro to a hash and check for it in the PEG rule(s), which do allow macros (infix expr, postfix expr, unop, binop, function call, ...). It's not so easy as in lisp, but much easier than with YACC and its operator precedence hacks :)

Other known PEG frameworks which allow macros, like parrot, perl6, katahdin or PFront use the trick to execute the parse at run-time, thus trading against performance. Or you can do both and allow pre-compiled and interpreted PEG parsing. There are several projects which thought about that, but you need a fast VM, like luajit, java, clr or friends.

I use special syntax block keywords to load external shared libraries with the external pre-compiled PEG parser. E.g. to parse SQL or FFI declarations into your AST. But you can also require a C compiler and compile the parse at run-time for all macros.


With PEG it is significantly easier than with anything else. First of all, Packrat-based parsers and alike are extensible. Your macro definition can modify the syntax, so the next time it is used it will be parsed naturally. See here and here some extreme examples of this approach.

Another possibility is to chain parsers, which is also trivial with PEG-based approaches.

0

精彩评论

暂无评论...
验证码 换一张
取 消