As someone who has from time to time written on parsing, and even published some parsing research, I get asked questions about parsing fairly regularly, some of which I’m even capable of answering.
One thing that I’ve increasingly been made aware of is a widening interest in “fancier” ways of editing text. In essence, most people who’ve been programming for a while come to realise that having an editor that only thinks of a program as a sequence of UTF-8 characters is rather limiting. I suspect the rapid proliferation of the LSP (Language Server Protocol) has opened even more people’s eyes to what the editing of programs could be.
One long-standing approach to better editing is structured editing (sometimes called “projectional editing”). The basic idea is to have an editor which fully understands the syntactic structure of the language you’re editing. This has various benefits: the editor can give instant feedback about what the next thing the user can type is; it makes semantic-based feedback (e.g. about the static typing of a program) much easier; it enables things like “select this function so I can copy it” 100% accurate, instead of using slightly dodgy heuristics; and so on.
To the best of my knowledge, the first structural editors were for Lisp dialects in the late 1960s. The best modern structured editor I’ve used is JetBrains’ MPS. If you haven’t tried using a structured editor, I’d encourage you to do so: it’s a much different experience than using even an LSP-powered editor and can change how you view the editing of programs.
However, there’s a reason why structured editors haven’t taken over the world: most programmers find them so annoying to use in some situations that they don’t view the pros as outweighing the cons. The basic problem is that most of us – even though we tend not to realise we’re doing so – sometimes edit programs in deliberate violation of their syntactic structure.
For example, I often use “square block” editing, where I edit characters at (say) columns 3–5 on all of lines 7–10 in a way that can suddenly turn a syntactically valid program into one with multiple, often bizarre, syntactic invalidities. Many structured editors simply don’t allow this sort of editing at all, because it doesn’t fit into the paradigm. Personally I do this sort of thing so often that if my editor were to forbid me from doing so, I would ditch it for a “more stupid” editor that does allow it — even if that meant foregoing other, useful, cleverness.
I thus must admit to some frustration when I’m pointed at yet another attempt to make structured editing work, that either doesn’t take this problem into account, or does so in an hoc manner. Typically, especially as a tool matures, some non-structural editing is allowed, but often only quite specific kinds, sometimes requiring you telling the editor in advance that you want to temporarily do things it rather you didn’t.
The reason I’m frustrated by this is that there is a published solution that enables us, in many realistic situations, to get all the advantages of structural editing while allowing non-structured editing: incremental parsing. The basic idea of incremental parsing is to allow people to edit programs as if they were sequences of UTF-8 characters but to maintain and update a parse tree in the background. Crucially, that parse tree can be arbitrarily “broken” when the user’s input is not syntactically correct, so users aren’t constrained by the tree’s existence.
Specifically I have pointed many, many people over the years to Tim Wagner’s thesis on incremental parsing. Tim’s thesis is the culmination of many people’s efforts over many years on this subject: it’s not quite the final word on the subject, but it’s close enough to serve as a real basis for practical tooling. My simplified explanation of it is that it contains three main algorithms: an incremental lexing algorithm; an incremental LR parsing algorithm; and an incremental GLR (i.e. all context-free grammars) parsing algorithm. Its algorithms are little short of beautiful: it allows programs to go from syntactically valid to invalid to valid (and so on) in a manner that is efficient, and which makes great sense to the programmer.
If you think that Tim’s thesis looks too academic to be plausible, you might be interested to know that you’re probably already using a real implementation of it: Max Brunsfeld, the dynamo behind Tree-sitter, has been clear about the influence of Tim’s work on Tree-sitter. That’s not to say that there is nothing more to be done in the area of incremental parsing. Lukas Diekmann, who works in the same research group as me, extended and fixed Tim’s algorithms. We later pushed that work even further, showing how it can serve as the basis for editing “composed” (or “polyglot”) programs in a satisfying manner. I’m sure much more can be done in this vein!
I know that some people who’ve worked on structured editing know and understand incremental parsing, and specifically Tim’s work. They might not agree with me on the practical utility of that work, which is totally fine – technically informed disagreement of this kind is part of how our society makes progress.
However, it is now clear to me that there is ongoing work on structured editing which either doesn’t know about incremental parsing in general, or Tim’s algorithms specifically. I hope this post serves as a useful advert to such folk that incremental parsing exists, and Tim’s and Lukas’s theses specifically, are absolutely worth the investment of time in reading.
I would love to see more progress in this area, and the more that those working on it know the breadth of our existing knowledge, the more likely we are to make meaningful progress. I hope to see more such progress soon!