Okay, in my quest to figure out the necessary stuff to write a compiler, I've reached a bit of a roadblock. It seems that every technology or tool that I find has some opposition somewhere.
I use Bison and Flex right now but I'm getting the feeling that this method is outdated. Is this true? Is this a good forward-compatible way to proceed with writing a full fledged programming language?
In a sea of different concepts and tools (ANTLR, LL(k), GLR, LALR, LLVM, Flex, Bison) What's the current trend and best practices for writing compilers? Is the dragon book out of date?
Unless you want to write a truly simple compiler, your focus is wrong.
Writing compilers is only a tiny bit about writing parsers. Having a parser is like climbing the foothills of the Himalayas when the problem is climbing Everest. You get to the top of the foothill and look up ... only 20,000 feet to go and you've only done the truly easy part. And you'll note the technology required to get to the top of the foothills is radically easier than the technology you need to go the rest of the way.
(FYI: the best present parsing technology is GLR, which easily accepts ambiguous grammars without hacking the grammar. GLR even easily parses C++, which violates the folk theorem that C++ is hard to parse. The folk theorem came from people trying to use YACC and ANTLR to parse it).
To build a compiler you need lots of machinery:
- AST building
- Symbol table construction
- Control flow analysis
- Data flow analysis
- Representation of program code essentially as a data flow computation (SSA or triples)
- A model of the target machine
- A means to map program code to machine instructions
- Register allocation
- Optimizations: constant propagation, loop unrolling, ...
We haven't even gotten near global flow analysis, global optimizations, or special handling for modern day instruction sets involving SIMD instructions or cache optimizations. ... The list goes on and on. The Dragon book gives a nice introduction to the basic topics, but doesn't address any of the advanced ones. You'll want Cooper's "Engineering a Compiler" and Muchnick's "Advanced Compiler Design" as references and it would be good if you had skimmed them well before you start.
Building a modern compiler is quite a feat of engineering.
Parsing, although heavily studied, is the least important part of compiling. (Exception: you're designing your own concrete syntax and you are continually refining and changing the language.)
Yacc, Bison, and friends were designed for an era of machines with 64K of memory. They're great for running fast on machines with limited memory. But the amount of human engineering required to force a grammar into LALR(1) form is ridiculous today. Ira Baxter is right that GLR is probably the best, most flexible parsing technology, but PEG (Parsing Expression Grammars) are also good. In both cases the human engineering is light-years ahead of the older tools.
Having dismissed parsing, I will now start another technology food fight :-) Compiling mostly consists of rewriting a program over and over from one form into another, until eventually you reach assembly code or machine code. For this kind of problem you don't really want to use C or C++:
Q: (Asked of Dave Hanson when he published his amazing book on lcc with Chris Fraser) "You and Chris have spent ten years building what may be one of the most carefully engineered compilers ever made. What did you learn from the experience?"
A: "Well, C is a lousy language to write a compiler in."
I urge you to try one of the popular functional languages, like Haskell or Standard ML. People who work in this field widely believe that compilers are the "killer app" for functional languages. Algebraic data types and pattern matching are tailor-made for writing abstract syntax into intermediate code into machine code. A good place to see the power of these techniques is Andrew Appel's book Compiling With Continuations. (Appel's compiler textbook is also a good read and a very elegant design, but he doesn't always explain why the design is the way it is.)
To build a compiler, I'd highly recommend standing on the shoulders of giants. There is a lot of good stuff out there that can be put together to make compilers. I've been working on a compiler part time for C/C++. It uses GLR for parsing, builds an AST, uses SSA as its intermediate form, does inter procedural optimizations, and generates code for X86, ARM, MIPS, PowerPC, Sparc, and others.
The secret? I borrowed code from several sources.
- The preprocessor and error reporting from clang
- The Elkhound and Elsa compiler generator and C/C++ compiler
- The LLVM system for optimization and code generation
Working part time I've been able to put together quite a useful system of tools. If I had tried to start from scratch, I'd barely have the parser finished by now. ;-)
http://ellcc.org
I'll assume you're in the same position as me: you want to write a compiler for fun, and to learn at least a little about each stage of it. So you don't want merely to write a plugin for an existing compiler. And you want to avoid using too many existing compiler modules, except where you can understand exactly what they're doing. In my case I am using bison
, which is a slight exception because it is doing at least a few things that I'm taking for granted (I did study grammars, etc. at university, but that was a long time ago). On the other hand, parser generators are common enough that it is a compiler stage worthy of interest: bison
may stop me writing much parsing code but it is giving me a change to write parser action code.
Contrary to some advice, I'd say you can get started without knowing everything about your input and target languages. With some exceptions, language features are not unfeasibly hard to add later. One exception I've discovered is control-flow: if you write most of the later manipulations to work on a tree form, it can be difficult to cater for statements like break
, continue
, and goto
(even the structured form). So I'd recommend translating from tree to CFG before doing too much of that.
- Write a parser for some reasonably stable subset of the input.
- Add actions that build a useful in-memory representation of it (typically a tree), and get it to print that.
- Get it to print it in a form that looks a bit like the target language. In my case I print the tree node for "x = y + z;" nodes as "ADD x, y, z"; "if (c) { ... }" turns into "bz c label1", then the translation of "..." then "label1:".
- Add optional stages in the middle. These can be optimisation and/or checking stages. You might need one that prepares the representation for easy code generation: I've got a stage that reduces overly complex expressions by adding temporary variables. (This is actually necessary for the output, because the "ADD" instruction can only work on simple inputs.)
- Go back and improve any part of it. E.g. put some checks in the parser actions so errors can be detected at that stage (use of undeclared variables, for instance).
It is surprisingly easy to get most of this done, if you take an iterative approach.
I can't give a comparison of the various approaches but the ANTLR group has covered a wide range of rich target languages:
which include most of the current common ones. ANTLR also supports a variety of output languages. We plan to tackle a CSS-like language
There's not really anything wrong with Flex and Bison, but if you're looking for something a bit more up to date (and object oriented) you might consider boost's Spirit library.
Did someone seriously ask if the dragon book could be out of date? It's the seminal work man. I can't tell you how much i learned just from the first two chapters(cause I've since forgotten it...ba-dum-bum).
Every technology (save maybe the goto statement) has both detractors and supporters. Don't get hung up on "making the right tooling choice" and go whole hog into learning the concepts and implementing them in a way that makes sense. I mean come on man even if you chose the perfect best tools in the world do you think you'd build something as loved, adored and respected as much as FORTRAN is these days...I mean we love it...right?
Of course not man...so much of learning comes from making the mistakes. That's where you learn the most.
YOU CAN DO IT!
Is this for 1) a big existing language like Java or C++ at one extreme, or 2) a little language without fancy datatypes at the other?
If 1, you better get up to speed on all the technologies that Ira mentioned.
If 2, you can do it in no time if you just write a recursive-descent parser, and either a) translate it into your-favorite-language (YFL) as it parses, or b) build a symbol table and parse tree, and then walk that to generate YFL. If you don't want to generate YFL, just write an interpreter that walks the parse tree.
If your goal is to learn all the tricky technologies, then do so. If not, quick-and-dirty is the way to go. If the latter, DO NOT worry about optimization!!
BTW, if you want to go really quick-and-dirty, and you have C or C++, and you're not too proud to write macros, a simple way to create a language is to just write a set of macros. That way you can create your own statements, while taking advantage of the datatypes, expression syntax, efficiency, and run time libraries of the underlying language.
精彩评论