|Home > Technical articles||e-mail: email@example.com github: ltratt twitter: @laurencetratt|
August 9 2007
Recently I was discussing Converge with someone, and mentioned how little time the
When I was younger, there were three things in computing that seemed so complex that I had no idea how anyone had ever created them; certainly, I didn't think I would ever be capable of working on such things. The imposing triumvirate were hardware, operating systems, and programming languages. Although electricity has always baffled me, the hardware people have done an amazing job on interoperability of components in recent years, such that I've never felt the need to gain a low-level understanding of how electrons whizz around my systems. Operating systems were partially demystified for me as I moved into the world of open source UNIX-like operating systems (for the record, I've been an OpenBSD user for 8 years or so), where one could poke and prod virtually any aspect - although kernels retain an aura of mystery for most of us. Despite the fact that I was familiar with several languages, and had even implemented a primitive assembler language (which, to my surprise, found its way into a real system - my first large-ish programming success),
Programming languages are, for most of us, an integral part of the computing infrastructure. Only a small handful of languages have ever had a wide impact, and by the time a programming language becomes popular it already probably has a decade or more of history behind it. This means that when the average programmer first comes across a
So let's start looking at things in more detail. If you break a programming language down, it only has a handful of things to it. What's more, the variance between different languages is - despite the unfortunate zealotry which often clouds debate - exceedingly small. In fact, at a high level there are really only two things which any language must have: a definition and an implementation. Sometimes a paper definition is produced up front, and only when this is complete is an implementation created. Sometimes the latter defines the former; but conceptually these are two separate things. Basically the definition says things like
Most people don't have much trouble understanding the high level view, but it's when they start trying to understand the compiler - the thing that ultimately they interact with - that things start getting more complex. The problem is that looking at implementations like Python (quite complex), Java (very complex), or GCC (something beyond merely
magic. Let's break them down a little more.
ParsingParsing is the process of reading in a big wodge of text, discovering its underlying structure, and then creating a parse tree which is easy to operate on. For English, this is a bit like taking in a sentence and discovering what its subject, its verb, and so on is. Parsing doesn't really uncover
meaningas such: that's for later stages.
If we have an input file such as the following:
import Sys func main(): Sys::println(2 + 3)
What is its parse tree? Well, in order to uncover that, we use a parsing system, which given a description of a language structure - a grammar - can automatically create a parse tree. Parsing has a bad reputation, often deservedly so: commonly used parsing technologies are often horrible to use, although it is perfectly possible to make much more pleasant parsing technologies. Regardless of that, this is the only thing in a compiler where you will need to check an external source to understand a bit more about grammars (Wikipedia's page on formal grammars is a decent start). An indicative chunk of the Converge grammar is as follows:
top_level ::= definition ( "NEWLINE" definition )* ::= definition ::= class_def ::= func_def ::= import ::= var_def ( "," var_def )* ":=" expr ::= splice ::= insert import ::= "IMPORT" import_name import_as ( "," import_name import_as )* import_name ::= "ID" ( "::" "ID" )* import_as ::= "AS" "ID" ::=In essence this says: a program consists of zero or more definitions; a definition is a class, a function etc.; an import is one or more module name imports, each of which can assign to a different variable name than the module name (using
top_level -> definition -> import -> <IMPORT import> -> import_name -> <ID Sys> -> import_as -> <NEWLINE> -> definition -> func_def -> func_type -> <FUNC func> -> func_name -> <ID main> -> <(> -> func_params -> <)> -> <:> -> <INDENT> -> func_decls -> expr_body -> expr -> application -> expr -> module_lookup -> expr -> var_lookup -> <ID Sys> -> <::> -> <ID println> -> <(> -> expr -> binary -> expr -> number -> <INT 2> -> binary_op -> <+> -> expr -> number -> <INT 3> -> <)> -> <DEDENT>This might look a bit imposing at first, but it's trivial to convert this back into the original input (although information about blank lines and comments would disappear reversing the transformation). If you want to see what the parse tree is for other Converge inputs, Converge helpfully includes a little program called
If you've understood this part, congratulations - you've got over the main stumbling block to creating a compiler.
From parse tree to abstract syntax treeAs stated earlier, a parse tree captures structure, but it doesn't capture
meaningas such, beyond that which is captured in the structure. This can be seen by the fact that the parse tree still contains things like the
:token, despite the fact this is for the users' visual benefit, rather than effecting the meaning of the program. What the second stage of a compiler does is to understand the meaning of the program (in some sense), strip out all the stuff which doesn't effect that, and convert it into another, very similar, data structure - the Abstract Syntax Tree (AST) - which captures this. For example the text
-> expr -> binary -> expr -> number -> <INT 2> -> binary_op -> <+> -> expr -> number -> <INT 3>into an AST along the lines of
From abstract syntax tree to object code
Object codeis an intentionally vague term - it might be machine code, VM bytecode, or even an intermediary programming language. Basically here we take an AST like
INT 2 INT 3 ADDIn a language such as Converge, where the VM is designed explicitly to be used with the language, then there is a strong relationship between the AST and the object code. In other words, this means that the conversion to object code is about the same size and complexity as the conversion from parse tree to AST. If you're converting to, say, assembly code then this will be a trickier task, but even then there will still be a largely mechanical element to the conversion. You can see this in Converge's
ConclusionThat's it. You don't need to know, or do, any more to create a compiler. When it's broken down in this simple way, I hope that it partly demystifies what a compiler is. There's nothing particularly magical about a compiler, and with the exception of parsing (which regretfully is often more involved than it should be), nothing particularly complex. I hope you get some sense of how little work there is in creating a compiler for a simple language. Of course, if you want to do something a little more complex, then things become rapidly more work, but you'd be amazed how few languages require such complexity.
In conclusion, the main difficulty for most of us in creating a compiler is overcoming the cultural absurdity which tells us that mere mortals can't create compilers. They can. I did. You can too.
|Home > Technical Articles||e-mail: firstname.lastname@example.org github: ltratt twitter: @laurencetratt|
|Copyright © 1995-2012 Laurence Tratt|