I'm interested in writing an x86 assembler for a hobby project.
At first it seemed fairly straight forward to me but the more I read into it, the more unanswered questions I find myself having. I'm not totally inexperienced: I've used MIPs assembly a fair amount and I've written a toy compiler for a subset of C in school.
My goal is to write a simple, but functional x86 assembler. I'm not looking to make a commercially viable assembler, but simply a hobby project to strengthen my knowledge in certain areas. So I don't mind if I don't implement every available feature and operation.
I have many questions such as: Should I use a one-pass or two-pass method? Should I use ad-hoc parsing or define formal grammars and use a parser-generator for my instructions? At what stage, and how do I resolve the addresses of my symbols?
Given my requirements, can anyone suggest some general guidelines for 开发者_StackOverflow中文版the methods I should be using in my pet-project assembler?
There is an excellent free e-book (pdf) on how to build assemblers and loaders by David Salomon. You can find it at:
http://www.e-booksdirectory.com/details.php?ebook=1311
You may find the dragon book to be helpful.
The actual title is Compilers: Principles, Techniques, and Tools (amazon.com).
Check out the Intel Architectures Software Developer's Manuals for the complete documentation of the IA-32 and IA-64 instruction sets.
AMD's architecture technical documents are available on its website as well.
Linkers and Loaders (amazon.com) is a good introduction to object formats and linking issues. (The unedited original manuscript is also available online.)
While many people suggest ad hoc parsers, I think these days one should use a parser generator because it really simplifies the problem of building all the complex syntax one needs for an interesting modern assembler. See my BNF example/answer to StackOverflow: Z80 ASM BNF.
"One pass" vs. "Two pass" refers to whether you read the source code itself twice. You can always do a one pass assembler. Here's two ways:
1) Generate binary results on the fly (think of these as pairs in the abstract that tend to have monotonically increasing addresses), and emit back patches as fixups when you find information that allows you resolve forward references (think of these as just pairs where the addresses are used to overwrite previously emitted locations). For JMPs, commit the type/size of JMP opcode when you encounter it. The default can be either short or long depending on taste or even an assembler option. A small bit of syntax entered by the coder saying "use the other kind" or "I insist on this kind" suffices (e.g., "JMP long target") to handle those case where the assembler's default choice is wrong. (This is assembler, its OK to have funky rules).
2) On the (first) pass, generate data to buffers in memory. Default JMPs (and other span-dependent instructions) to short offsets. Record the locations of all JMP (span dependent instructions, etc.). At the end of the this pass, go back to the JMPs and revise those that are "too short" to be longer; shuffle the code and adjust the other JMPs. A clever scheme to do this and achieve nearly optimal sets of short JMPs is document in this 1978 paper: Assembling code for machines with span-dependent instructions/Szymanski
You will need to write a lexer and parser to read in source code and output the abstract syntax tree (AST). The AST can then be traversed to generate the byte code output.
I recommend researching books on writing a compiler. This is usually a college level class, so there should be plenty of books. Sorry, I can't recommend one in particular.
You can also read up on the ANTLR tool. It can take in grammar rules and output code in various languages to do the lexer/parser work for you.
On the one-pass or two-pass: you would need a two-pass compiler to resolve forward references. If that's not important, then a one-pass will do. I recommend you keep it simple, since this is your first compiler.
To answer one of your questions, one-pass is not viable, unless you emit code after the pass.
Imagine this:
JMP some_label
.. code here
some_label:
what do you emit as the distance-value for the JMP instruction? Which JMP instruction do you emit, the one requiring a close value, or is the label far away?
So two-pass should be a minimum.
Given that this is a hobby project, a lot of your questions really come down to 'what aspects of the problem are you most interested in looking at and learning about?' If you're interested in seeing how parsing tools map to the problem of assemblers (particularly as it comes down to macro processing and the like), you should use them. On the other hand, if you're not too interested in those questions and just want to get into the questions of instruction packing and layout and are content to have a minimal assembler with no macros, then quick and dirty for the parsing is probably the way to go.
For single pass vs multipass -- are you interested in playing with making a very fast assembler with minimized memory footprint? If so, this question becomes relevant. If not, just slurp the entire program into memory, deal with it there, creating an object image in memory, and then write that out. No real need to worry about 'passes' as such. In this model, you can more easily play around with doing things in different orders to see what the tradeoffs are, which is much of the point of a hobby project.
I've often fantasized about trying to build (yet another) high level computer language. The object would be to try to push the envelope of rapidity of development, and the performance of the result. I would try to build libraries of sort of minimal, fairly highly optimized operations, and then try to develop the language rules in such a way that any statement or expression expressible in the language would result in optimal code.. unless what was being expressed was just inherently suboptimal.
It would compile to byte code, which would be distributed, and then to machine code when installed, or when the processor environment changed. So when an executible loaded, there would be a loader piece that would check the processor and a few bytes of control data in the object, and if the two matched, then the executible part of the object could be loaded straight away, but if not, then the byte code for that object would have to be recompiled, and the executible part updated. (So it's not Just In Time compilation - it's On Program Install or on CPU Changed compilation.) The loader part would be very short and sweet, it would be in '386 code so it wouldn't need to be compiled. It would only load the byte-code compiler if it needed to, and if so, it would load a compiler object that was small and tight, and optimized for the detected architecture. Ideally, the loader and the compiler would stay resident, once loaded, and there would only be one instance of both.
Anyway, I wanted to respond to the idea that you have to have at least two passes - I don't think I quite agree. Yes, I would use a second pass through the compiled code, but not through the source code.
What you do is, when you come across a symbol, check your symbol hash table, and if there's no entry there, create one, and store a 'forward reference' marker in your compiled code with a pointer to the table entry. As you come across the definitions for labels and symbols, update (or put new data into) your symbol table.
Individual compiled objects should never be so large that they take up very much memory, so, definitely all the compiled code should be held in memory until the whole thing is ready to be written out. The way you keep your memory foot print small is simply by only dealing with one object at a time, and by never keeping more than one small buffer full of source code in memory at a time. Maybe 64k or 128k or something. (Something large enough that the overhead involved in making the call to load the buffer from disk is small in comparison with the time it takes to read the data from disk, so that the streaming is optimized.)
So, one pass through the source stream for an object, then you string your pieces together, collecting the necessary forward reference info from the hash table as you go, and if the data is not there - that's a compile error. That's the process I would be tempted to try.
Take NASM's tables, and try to implement the more basic instructions, using the tables for decoding
I have written couple of parsers. I wrote couple of hand-made parser and I tried yacc type of parsers too....
Hand-made parsers provide more flexibility. Yacc gives a framework which one must adapt to or fail.Yacc parser gives a fast parser by default but going after shift/reduce and reduce/reduce may require quite an effort if you are not familiar with one those means and your parser environment is not the best one. About the advantage of Yacc. It gives you a system if you need one. Hand-made parser gives you freedom but can you fili it in? Assembly language seems to be simple enough to be handled by yacc or similar parsers.
My handmade parser would contain a tokenizer/lexer and I would go through the array of tokens with for loop and perform some kind of event handling by placing ifs or case statement in the loop and examining the current token or the next/previous. It is possible I would use a separate parser for expressions... I would put the translate code into array of strings and "note down" uncalculated parts of the translated code so the program can come to them later and fill the blanks.. There may be blanks and not everything is known in advance when one parses the code. E.g. the location of jumps.
On the other hand, whatever way you do your parser first time and you have time , you can convert your parser from one type to another. Depending who you are, you may even like that.
There are other parsers than Yacc and they promise more flexibility and less "errors" but that does not mean you do not get errors, they will not be so visible and it may be not so fast. If that is important.
By the way, if the tokens were stored, one even could be thinking about a mixed yacc and hand-made parser.
精彩评论