Etaniel Strashimirov (2023-01-10 18:01:31) Permalink
I don't understand. First: The conclusion states that there is a difference between specification and implementation, but isn't that obvious? Like, one is a doc, you can only read it, the other one is an implementation, you can use it to execute code. Second: By compilation I typically understand turning the code from one representation to another. I.e. when you start turning the input into opcodes is when you are doing compilation. Typically I'd expect the "doing compilation" to imply also turning code from well specified representation to another (x86, jvm, etc all have specs). Right? Back in the day people used to write asm directly, then for various reasons people started using code to produce the binary. But the binary is the goal, not the original code. Third: From all of that and your second conclusion, isn't the proper conclusion that there is a difference between Potayto and Potahto? How am I wrong?

Tanvi Sharma (2023-01-11 06:35:09) Permalink
Thank you for clearly showing the difference between language specification, implementation, compiler and interpreter. Very good read. However, could you also explain why two interpreters/implementations can show different outputs (CPython and PyPy in the article) for the same language specification and how are still both considered correct?

Laurence Tratt (2023-01-11 08:19:57) Permalink
@Tanvi Specifications don't specify every aspect of how programs run, sometimes out of choice, sometimes because it would be impractical, and sometimes because no-one realised that such flexibility was allowed.

For example, out of choice, specifications such as Python's often want to leave open the possibility of an implementation caching certain objects, since that can make programs faster (even though users can then observe seemingly surprising behaviour, such as in the example in the post you're referencing).

As an example of "impractical" - albeit at the far end of things! - one can often tell which implementation one is running on by timing parts of a program at run-time. A specification could remove that way of observing differences by mandating that every operation takes a precise amount of time to run -- but that would leave us disappointed when we buy a faster computer and our programs run at the speed as before!

Finally, as we've looked more and more to programming languages to squeeze out extra performance, language implementers have increasingly started to examine language specifications with lawyer-like eyes, to find flexibility where none might have been intended. Simplifying grossly, since writing specifications (in natural or formal languages) is hard, there are nearly always unintended gaps: clever language implementers make use of those unintended gaps to make programs run faster while still (technically) adhering to the specification.


Tanvi Sharma (2023-01-11 13:50:48) Permalink
Thank you for the detailed explanation @Laurence (only if the language specification was so detailed!). It’s amusing to see that all implementations are technically correct.

Jason C. McDonald (2023-01-12 21:23:37) Permalink
I've started calling languages like C++, C, etc. "assembled languages", because the assembly to machine code is performed development-side, and the machine code is shipped to the user.

David Waroff (2023-01-12 21:30:24) Permalink
I've rarely heard of bytecoded interpreters being called anything else except in comercial settings.

I think BF is a great choice for illustrating interpretation in general, but poorly illustrates the advantages of bytecoding, as it's operators are already single characters.


Mario Wolczko (2023-01-14 03:41:48) Permalink
There are relatively few language implementations which are pure interpreters, by which I mean the source code is interpreted directly. Most implementations first transform the source code into some other representation before interpretation. This transformation has most of the elements of compilation: parsing of source, reporting of syntax errors, some amount of program analysis, and the construction of an alternate representation. This can happen multiple times even for the same source language: e.g., Java is compiled by javac to bytecode; the JVM may then transform the bytecode to some other IR before compiling this IR to machine code.
The point I'm leading to is that I find it helps, when talking about an interpretation or compilation element in a language implementation, to specify the source and target languages of that element. So, in your final example, BF is compiled to an opcode language, which is then interpreted. I think it's reasonably common practice to say that a language is compiled or interpreted by an implementation depending on what the last element is of that implementation does, and so I would call your last example an opcode interpreter, and not a compiler, even though there is compilation to opcode. The last element tends to dominate performance.

How to distinguish between a compiler and an interpreter? One way is to look at the time it takes. A compiler should run in time roughly proportional to the size of the source, whereas the runtime of an interpreter is determined by how much work the source program has to do for a given input. For example, suppose the source program reads a number and then has a loop whose trip count is this number. The time it takes to compile this program will be independent of the number, in contrast to the time it takes to interpret the program. The CPU, in effect, is an interpreter of machine code (although many CPUs actually translate fragments of the program from the advertised instruction set to an internal one which is more efficiently interpreted in hardware).

I spent some time agonizing over these topics for the class on VMs I taught some years ago (www.wolczko.com/CS294). Then I concluded that a key characteristic of interpretation is that, at some granularity (both in program space and in time), an interpreter is context-free -- it does not consider the context of the larger program or execution history when it is executing a given language construct. In contrast a compiler usually "knows" about the context at least up to the level of the surrounding function/method, and maybe a lot more than that. Some people may disagree with that claim, but, I suspect we can all agree that coming up with a crisp definition that distinguishes interpretation from compilation is surprisingly difficult!


Kai Zheng (2023-10-27 11:37:05) Permalink
I think "if i + 3 < prog.len()" should be "if i + 2 < prog.len()"

Laurence Tratt (2023-10-27 11:45:43) Permalink
@Kai I guess you're referring to:

if i + 3 < prog.len() && matches!(prog[i..i+3], [Ops::LBrack(_), Ops::Sub(1), Ops::RBrack(_)]) {

? in which case I think the +3 is needed to stop an indexing error in prog[i..i+3]. I haven't thought about this in a fair while, so perhaps there's something I'm missing.


Laurence Tratt (2023-10-27 15:45:31) Permalink
@Kai Ah, I now see what you mean -- I should have thought about the code rather than just copy and paste it! Yes, you're right: the +3 should be +2.

Kai Zheng (2023-10-28 02:17:30) Permalink
@Laurence I’m very exciting when reading this post because I learn a lot from it and look at something in a different view. I just find this little bug when wirting rust code on my computer. My BF just contains “++++[-]”, I find the program can’t run correctly, so all of this is an accident.

Laurence Tratt (2023-10-28 08:16:29) Permalink
@Kai I've now fixed the code both here in the post and in the repository. Thanks for the bug report!