yacc/lex or hand-coding?_问答_开发者_运维开发者技术经验分享

I am working on new programming language, but I was always puzzled by the fact that everyone is using yaxx/lex to parse the code, but I am not.

My compiler (which is already working) is handcoded in C++/STL, and I cannot say it's complex or took too much time. It has both some kind of lexer and parser, but they are not autogenerated.

Earlier, I wrote a C compiler(not full spec) the same way - it was able to compile th开发者_开发技巧e program in 1 pass, with all these back references resolving & preprocessing - this is definitely impossible with yacc/lex.

I just cannot convince myself to scrap all this, and start diving into yaxx/lex - which might need quite an effort to implement and might possibly introduce some grammar limitations.

Is there something I miss when not using yacc/lex? Do I do an evil thing?

The main advantages of using any kind of lexer/parser generator is that it gives you a lot more flexibility if your language evolves. In a hand-coded lexer/parser (especially if you've mixed in a lot of functionality in a single pass!), changes to the language get nasty fairly quickly, whereas with a parser generator you make the change, re-run the generator, and move on with your life. There are certainly no inherent technical limitations to always just writing everything by hand, but I think the evolvability and maintainability of automating away the boring bits is worth it!

Yacc is inflexible in some ways:

good error handling is hard (basically, its algorithm is only defined to parse a correct string correctly, otherwise, all bets are off; this is one of the reasons that GCC moved to a hand-written parser)
context-dependency is hard to express, whereas with a hand-written recursive descent parser you can simply add a parameter to the functions

Furthermore, I have noticed that lex/yacc object code is often bigger than a hand-written recursive descent parser (source code tends to be the other way round).

I have not used ANTLR so I cannot say if that is better at these points.

The other huge advantage of using generators is that they are guaranteed to process exactly and only the language you specified in the grammar. You can't say that of any hand-written code. The LR/LALR variants are also guaranteed to be O(N), which again you can't assert about any hand coding, at least not without a lot of effort in constructing the proof.

I've written both and lived with both and I would never hand-code again. I only did that one because I didn't have yacc on the platform at the time.

Maybe you are missing out on ANTLR, which is good for languages that can be defined with a recursive-descent parsing strategy.

There are potentially some advantages to using Yacc/Lex, but it is not mandatory to use them. There are some downsides to using Yacc/Lex too, but the advantages usually outweigh the disadvantages. In particular, it is often easier to maintain a Yacc-driven grammar than a hand-coded one, and you benefit from the automation that Yacc provides.

However, writing your own parser from scratch is not an evil thing to do. It may make it harder to maintain in future, but it may make it easier, too.

It certainly depends on the complexity of your language grammar. An easy grammar means that there is an easy implementation and you can just do it yourself.

Take a look at maybe the worst possible example at all: C++ :) (Does anybody knows another language, besides natural languages, which are more difficult to parse correctly?) Even with tools like Antlr, is it quite difficult to get it right, though it is manageable. Whereby on the other side, even while being much harder, it seems that some of the best C++ parsers, e.g. GCC and LLVM, are also mostly handwritten.

If you don't need too much flexibility and your language is not too trivial, you will certainly safe some work/time by using Antlr.