| Home > Technical articles | laurie@tratt.net |
|
August 9 2007 Recently I was discussing Converge with someone, and mentioned how little time the When I was younger, there were three things in computing that seemed so complex that I had no idea how anyone had ever created them; certainly, I didn't think I would ever be capable of working on such things. The imposing triumvirate were hardware, operating systems, and programming languages. Although electricity has always baffled me, the hardware people have done an amazing job on interoperability of components in recent years, such that I've never felt the need to gain a low-level understanding of how electrons whizz around my systems. Operating systems were partially demystified for me as I moved into the world of open source UNIX-like operating systems (for the record, I've been an OpenBSD user for 8 years or so), where one could poke and prod virtually any aspect - although kernels retain an aura of mystery for most of us. Despite the fact that I was familiar with several languages, and had even implemented a primitive assembler language (which, to my surprise, found its way into a real system - my first large-ish programming success), Programming languages are, for most of us, an integral part of the computing infrastructure. Only a small handful of languages have ever had a wide impact, and by the time a programming language becomes popular it already probably has a decade or more of history behind it. This means that when the average programmer first comes across a So let's start looking at things in more detail. If you break a programming language down, it only has a handful of things to it. What's more, the variance between different languages is - despite the unfortunate zealotry which often clouds debate - exceedingly small. In fact, at a high level there are really only two things which any language must have: a definition and an implementation. Sometimes a paper definition is produced up front, and only when this is complete is an implementation created. Sometimes the latter defines the former; but conceptually these are two separate things. Basically the definition says things like Most people don't have much trouble understanding the high level view, but it's when they start trying to understand the compiler - the thing that ultimately they interact with - that things start getting more complex. The problem is that looking at implementations like Python (quite complex), Java (very complex), or GCC (something beyond merely
magic. Let's break them down a little more. ParsingParsing is the process of reading in a big wodge of text, discovering its underlying structure, and then creating a parse tree which is easy to operate on. For English, this is a bit like taking in a sentence and discovering what its subject, its verb, and so on is. Parsing doesn't really uncovermeaningas such: that's for later stages. If we have an input file such as the following: import Sys func main(): Sys::println(2 + 3) What is its parse tree? Well, in order to uncover that, we use a parsing system, which given a description of a language structure - a grammar - can automatically create a parse tree. Parsing has a bad reputation, often deservedly so: commonly used parsing technologies are often horrible to use, although it is perfectly possible to make much more pleasant parsing technologies. Regardless of that, this is the only thing in a compiler where you will need to check an external source to understand a bit more about grammars (Wikipedia's page on formal grammars is a decent start). An indicative chunk of the Converge grammar is as follows:
top_level ::= definition ( "NEWLINE" definition )*
::=
definition ::= class_def
::= func_def
::= import
::= var_def ( "," var_def )* ":=" expr
::= splice
::= insert
import ::= "IMPORT" import_name import_as ( "," import_name import_as )*
import_name ::= "ID" ( "::" "ID" )*
import_as ::= "AS" "ID"
::=
In essence this says: a program consists of zero or more definitions; a definition is a class, a function etc.; an import is one or more module name imports, each of which can assign to a different variable name than the module name (using as). Given that and our input program, the following parse tree is produced (using only a couple of lines of code to call a library function or two):
top_level
-> definition
-> import
-> <IMPORT import>
-> import_name
-> <ID Sys>
-> import_as
-> <NEWLINE>
-> definition
-> func_def
-> func_type
-> <FUNC func>
-> func_name
-> <ID main>
-> <(>
-> func_params
-> <)>
-> <:>
-> <INDENT>
-> func_decls
-> expr_body
-> expr
-> application
-> expr
-> module_lookup
-> expr
-> var_lookup
-> <ID Sys>
-> <::>
-> <ID println>
-> <(>
-> expr
-> binary
-> expr
-> number
-> <INT 2>
-> binary_op
-> <+>
-> expr
-> number
-> <INT 3>
-> <)>
-> <DEDENT>
This might look a bit imposing at first, but it's trivial to convert this back into the original input (although information about blank lines and comments would disappear reversing the transformation). If you want to see what the parse tree is for other Converge inputs, Converge helpfully includes a little program called convergep which given a source file input automatically prints out its parse tree (the above output is cut 'n' paste straight from convergep).
If you've understood this part, congratulations - you've got over the main stumbling block to creating a compiler. From parse tree to abstract syntax treeAs stated earlier, a parse tree captures structure, but it doesn't capturemeaningas such, beyond that which is captured in the structure. This can be seen by the fact that the parse tree still contains things like the :token, despite the fact this is for the users' visual benefit, rather than effecting the meaning of the program. What the second stage of a compiler does is to understand the meaning of the program (in some sense), strip out all the stuff which doesn't effect that, and convert it into another, very similar, data structure - the Abstract Syntax Tree (AST) - which captures this. For example the text 2 + 3 is converted from its parse tree equivalent:
-> expr
-> binary
-> expr
-> number
-> <INT 2>
-> binary_op
-> <+>
-> expr
-> number
-> <INT 3>
into an AST along the lines of Add(Int(2), Int(3)). Notice that typically the parse tree and the AST have almost identical structure. Because of this, the conversion from parse tree to AST is generally very simple. For Converge, this conversion is captured in the compiler/Compiler/IModule_Generator.cv file. As the language grows, clearly this conversion gets larger, but because it is largely mechanical (and therefore often a simple cut 'n' paste job), it is not a difficult task. For example in Converge's early days, when it had approximately the same expressivity as Python, this conversion took around 2-3 days to code from scratch.
From abstract syntax tree to object codeObject codeis an intentionally vague term - it might be machine code, VM bytecode, or even an intermediary programming language. Basically here we take an AST like Add(Int(2), Int(3)) and turn it into instructions such as:
INT 2 INT 3 ADDIn a language such as Converge, where the VM is designed explicitly to be used with the language, then there is a strong relationship between the AST and the object code. In other words, this means that the conversion to object code is about the same size and complexity as the conversion from parse tree to AST. If you're converting to, say, assembly code then this will be a trickier task, but even then there will still be a largely mechanical element to the conversion. You can see this in Converge's compiler/Compiler/Bytecode_Generator.cv file.
ConclusionThat's it. You don't need to know, or do, any more to create a compiler. When it's broken down in this simple way, I hope that it partly demystifies what a compiler is. There's nothing particularly magical about a compiler, and with the exception of parsing (which regretfully is often more involved than it should be), nothing particularly complex. I hope you get some sense of how little work there is in creating a compiler for a simple language. Of course, if you want to do something a little more complex, then things become rapidly more work, but you'd be amazed how few languages require such complexity.In conclusion, the main difficulty for most of us in creating a compiler is overcoming the cultural absurdity which tells us that mere mortals can't create compilers. They can. I did. You can too. |
| Home > Technical Articles | Copyright © 1995-2008 Laurence Tratt laurie@tratt.net |