Laurence Tratt: The Evolution of a Research Paper

Podcast: mp3, Opus

As far as I can recall, the first time that I truly tried to read a research paper was when I started my PhD. I don’t remember which paper was the first recipient of my incompetent gaze, but I do remember how completely out of my depth I felt when trying to make sense of it. Over many subsequent papers I tried during that period, I made very little progress in meaningfully digesting any of what I was reading. I quite often found a fog descending on my brain, a curious mix of tiredness and bewilderedness, that left me feeling defeated.

In retrospect, I faced three main issues: I didn’t have a good technique for reading papers 1; I was trying to read papers that far exceeded my rather narrow technical knowledge; and I wasn’t reliably selecting papers that were well written. To a reasonable extent, sheer repetition tends to fix all of these issues.

However, I eventually realised there was a much deeper issue: I did not understand how papers come into being and certainly had no idea how I should go about writing my own. From my naive perspective, papers seemed to come fully formed as 20 page PDFs, full of dense prose, foreboding formulae, and complex tables. I had little idea how they got to that state and, for whatever reason, nobody I bumped into ever seemed to talk in a meaningful way about the mechanics of writing a paper. Even now, I’m not sure how most people or groups go about writing their papers.

A few years back I wrote about what I’ve learnt about writing research papers, but I suspect that does a better job of explaining the detailed mechanics of writing parts of a paper than it does giving a sense of the whole. When I recently saw a simple but effective animation of a paper evolving during its writing process, I realised that it might help give a high-level view of the whole of a paper. What I’m going to do in this post is to take two papers I’ve been involved with and use similar animations to give a sense of how the paper evolved, as well as some of the human story behind why it evolved that ways. I’m not going to try and derive grand lessons from these two examples, but I hope that they might help slightly demystify how papers end up looking as they do.

The warmup paper

Let’s start with Virtual Machine Warmup Blows Hot and Cold which is probably the most demanding paper I’ve ever been involved with. To some extent, one can recover the paper’s story from the repository’s commits, but even that won’t tell you everything. So I’m now going to briefly explain the story of the paper 2 from my perspective 3.

It all started innocently enough in May 2015 when Edd Barrett undertook what I thought would be a “quick” bit of work to see how long it takes programming language VMs (Virtual Machines) such as the JVM to “warmup” (i.e. how long does it take them to reach the “steady state of peak performance”?). A couple of weeks later, Edd showed us some rough data, which suggested that VMs often don’t warm-up. This was, to put it mildly, surprising. My immediate worry was that we might have overlooked an important factor (e.g. servers overheating and a resultant throttling of performance) that, if fixed, would make all the odd results disappear. Soon joined by Carl Friedrich Bolz and Sarah Mount, we embarked on a lengthy iterative process, each time trying to think of something we might have missed, or got wrong, addressing that issue, and rerunning the experiment to see if anything meaningful changed.

By September 2015, the data we were seeing wasn’t fundamentally very different from May, and we were starting to feel more confident that we hadn’t done anything stupid. However, “more confident” is relative: we were still worried that we’d missed something obvious. We thus decided to create a formal draft paper that we could pass around to others for comment.

As soon as we started creating that draft paper, it became clear that while we had lots of data, we had no idea how to process it or present it. Vince Knight, a statistician, started giving us advice 4, soon putting us in touch with Rebecca Killick, an expert in changepoint analysis. Without the use of changepoint analysis, the eventual paper wouldn’t have been worth reading.

However, it soon became apparent that integrating changepoint analysis was not going to be a quick job. Since we still wanted comments on our experimental setup, the first version of the paper thus doesn’t have any statistical summaries at all. Despite that limitation, we got useful, detailed feedback from several people who I admire. We also discovered, quite unexpectedly, that many senior members of the VM community had seen peculiar benchmarking results over the years, and were surprised only by the extent of the oddities we were seeing. That helped convince me that the path we were on was a worthwhile one.

We kept beavering away on the research and the paper throughout 2016, most notably integrating a new way of producing benchmarking statistics with changepoints. By December 2016 we’d produced a second version of the paper, which we submitted to a programming languages research conference. Though imperfect, most of the major parts of the paper were now in decent shape, and that version of the paper bears a clear resemblance to the final version. However, the paper was rejected, and the reviewer comments we received didn’t really give us useful pointers for improvement 5.

Fortunately for us — and despite having worked on the research for 2 years, and the paper for 18 months by this point — we were still thinking of improvements ourselves, though we were now firmly in the territory of diminishing returns. By April 2017 we submitted the third version of the paper to another conference where it was conditionally accepted, meaning that we needed to address a handful of specific comments in order to convince the reviewers that the paper should be fully accepted. As is common in such cases, addressing those comments wasn’t too difficult (the expectation is that conditionally accepted papers don’t need too much work done on them in order to be fully accepted).

However, one small part of a comment from a reviewer raised a much bigger question in our heads than I think the reviewer could ever have intended. In essence, we wondered what would have happened if we’d not run benchmarks for as long as we did: would shorter runs still have highlighted as many odd things? We put together a partial answer to this question which you can see in Figure 9 of the fourth version of the paper. That version of the paper was enough to make the reviewers happy, and enable the paper to be fully accepted at the conference, but we still felt that we hadn’t addressed this other question adequately, so we kept plugging away.

Amongst the oddities of computing conferences is that they impose a fixed deadline on submission of a final version of a paper, generally around a month after initial notification. We now had such a deadline, but we were also uncovering new challenges in addressing the “what would have happened?” question. I then took my eye off the ball somewhat, and didn’t stop to think that as we improved our answer to the question, we were slowly but steadily increasing the amount of data we needed to process. Suddenly, 3 days before the deadline, a back of the envelope calculation suggested that our server would take about 20 days to process the data. Frankly, we looked to be in trouble, and I remember thinking that there was now a small, but definite, possibility that we would have to retract the paper. Thankfully, we dug ourselves out of the hole we found ourselves in by crudely breaking our Python processing program up into multiple processes and renting a beefy, and expensive AWS server 6. It crunched the data in under 24 hours, allowing us to produce the data that ended up as Figure 11 in the fifth version of the paper 7.

That’s the human side of the story, although a few quick numbers may help round things out. We started the underlying research in May 2015 and the paper in September 2015 — and we kept actively working on both until September 2017. As part of the research we created Krun and the experiment itself. The paper itself has 1001 commits, and we made six releases of the paper on arXiv.

Let’s now look at the animation 8 of this paper evolving (skip to 45s if you want to avoid my ugly mug), with each frame representing the state of the paper at one of its 1001 git commits (with the date and time of each commit shown in the bottom right):

You might find it useful to speed up, or slow down, the animation — in my experience, different speeds tend to highlight different aspects of the paper’s evolution. A few further details might help you make sense of the animation. Commits that have a git tag associated with them (representing “releases” of some sort or other) pause the animation for a couple of seconds showing, in red, the tag name alongside the time (e.g. at 1:50 9). Within the paper itself, text in red represents a comment, or question, from one author to another. Not every version of the paper is perfectly rendered, because there is an element of bit rot. For example, while TeX is famously stable, LaTeX, and its distributions, aren’t — packages come and go, and backwards compatibility is hit and miss. We also didn’t write the paper in the expectation that anyone would go try and build old versions. However, other than the bibliography sometimes disappearing in the first quarter of the animation, the imperfections seem minor.

A few things that the animation highlights are worth looking at in detail. The paper starts off as a literal skeleton 10: a title, some authors (and not the final set of authors), a couple of section headings (but little text), and a single citation of another paper (just to make sure the bibliography builds). It then goes through a rapid period of expansion as draft text representing the state of underlying research at that point is added.

Fairly quickly, one sees the first big chunks of red text 11, which is generally one of us noticing that something is either incomplete or nonsensical. This is the first time that one can start to tease apart the relationship between a paper and the research underlying it: for me, a paper is the condensed version of the understanding generated from the underlying research. In my experience, the process of “turning” research into a paper regularly uncovers gaps in the underlying research that we had either not noticed or sometimes actively ignored and which need fixing. This, to my mind, is a virtuous cycle: writing the paper highlights weaknesses in the underlying research that, when fixed, improve the paper.

The periods of expansion sometimes involve a number of commits, but typically last only a few days. Such periods are followed by much lengthier periods of consolidation, which can sometimes last months. Many of these periods of consolidation are where gaps in the research are slowly filled in, though a few exist simply because everyone involved was busy elsewhere! The remainder are those periods leading up to a release of the paper. Some are explicitly “draft” releases, some are submissions to venues (in computing, currently, these mostly take the form of conferences), and there’s generally a “final” release 12. You’ll notice occasional major changes in the “look” of the paper where a change in the venue we are targeting requires us to change to that venue’s formatting requirements (e.g. at 1:20: before 13 and after 14).

The final thing the animation makes clear is that, as time goes on, the periods of expansion become shorter, and less frequent, and the periods of consolidation become not just longer, but involve less profound changes. The amount of red text in each commit becomes smaller as the gaps in the research that we notice become smaller. Nevertheless, even at 3 minutes into the animation – fairly near the end of the paper’s evolution – there are noticeable additions to it.

The error recovery paper

Of course, everything that I’ve written above relates to a single paper which may or may not be representative of any other paper. It’s thus worth having a look at a second paper, both as a sanity check, and also to see if anything else interesting arises. In this case I’m going to look at Don’t Panic! Better, Fewer, Syntax Errors for LR Parsers, a paper which shows how one one can add practical, automatic, syntax error recovery to LR grammars. It is not my intention to expatiate with the same minuteness on the whole series of its history, but there is one thing that I want to tease out. Let’s start with the animation (which starts at 40s):

This paper wasn’t a small effort — its writing still spanned two and a half years — but it wasn’t on the same scale as the warmup paper. It has one third as many commits and a smaller, though not exactly tiny, experiment. Unlike the warmup work, however, it led to software (grmtools) which non-researchers might be interested in using 15.

It ran through a similar, though slightly different evolution, to the warmup paper. I ran a private early draft past a couple of people whose work in the area I respected before the first “public” release on arXiv. It was rejected at a conference and a journal before being accepted at a conference.

I think the most interesting thing about this paper is something you can’t see in its final version, which contains a single new algorithm ‘CPCT+’. However, in the video you’ll see at around 1:55 a major reformatting (before 16 and after 17) almost immediately followed by a significant reduction in the paper’s contents (before 18 and after 19). If you look at the second version of the paper, just before this reduction happens, you’ll see that the paper contained a second algorithm ‘MF’.

MF wasn’t removed because of page constraints 20, nor because of reviewer comments: I took the decision to remove it because I thought it would make the paper better. MF is in most respects a superior algorithm to CPCT+: MF runs faster, and fails less often. However, CPCT+ already runs quite fast and doesn’t fail very often, so although MF is demonstrably better there isn’t much scope for it to be much better.

MF is, however, much more complex than CPCT+. I don’t have exact figures, but it wouldn’t surprise me if I spent ten times as long on MF as I did on CPCT+. That meant that I was beguiled by MF and my part in it, without stopping to think “is this complexity worth it?” Eventually that question became difficult for me to ignore. If I had to guess, I reckon it might take someone else at least five times longer to integrate MF into their parsing system than CPCT+. Because MF’s performance gains over CPCT+ are rather modest, this seems a poor trade-off.

Thus, reluctantly, I had to conclude that by including MF, I was demanding more from every reader than I was giving back. Put another way, if I’d have left MF in, I’d have been prioritising my ego over the reader’s time: that might be a good trade-off for me, but not for the community overall. Once I’d reached that decision, deleting the text (and the code) was inevitable. That doesn’t mean that deleting it was easy: I was effectively writing off 3–6 months work of my own time. Although that sounds like a lot of time, it’s not unique: in my experience it’s common for weeks of work on the underlying research either not to make it into the resulting paper at all, or at most to form part of a single sentence.

Summary

My aim in this entry hasn’t been to derive general lessons that you, or even I, can apply to other papers: each research project, and each paper, is a bespoke effort 21. However, I think that a few common themes are worthy of note.

First, there’s a difference between the underlying research and a paper representing that research: by definition the latter can only contain a subset of the former (see e.g. the deletion of MF).

Second, the act of writing a paper frequently uncovers holes in the underlying research that need to be filled in.

Third, writing a paper is a slog: in my opinion, every sentence, every paragraph, and every figure is important, and getting them all into shape is no small task.

Fourth, writing up research and releasing papers is important: I’ve lost track of how many times I’ve heard of interesting systems that weren’t written up and which no longer run and thus are lost to us. To me, writing a paper isn’t about gaining another line on my CV: it’s about clearly and accurately articulating the underlying research, warts and all, to the best of my ability, so that other people can understand what’s been done and decide whether, and how, to build upon it.

I started this entry off by noting how difficult I used to find it to read papers. I don’t suppose anything I’ve written in this entry helps make the task of reading a paper easier, but perhaps it has at least given you a little insight into the process by which a paper is produced.

Acknowledgements: Thanks to Edd Barrett, Carl Friedrich Bolz-Tereick, Aleks Bonin, Martin Chapman, Lukas Diekmann, Jake Hughes, and Dan Luu for comments.

Newer 2021-01-19 08:00 Older

If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

It took me many years to realise that there are different approaches to reading a paper, that some approaches are meaningfully better than others, and that different people respond better to some approaches than others. These days, my general approach is to try reading a paper as quickly as possible: I try not to linger on parts that I don’t understand, because I’ve learnt from experience that confusion is often resolved by a later part of the paper. Once I’ve finished, I take a short period to digest what I’ve read and then, if necessary, I’ll go back and do another read through.

☒

I’ve talked a little bit about the story behind the paper before, but not in great detail.

☒

I’ve talked a little bit about the story behind the paper before, but not in great detail.

As with pretty much any human activity, different co-authors on a paper will nearly always have a slightly different perspective on what happened, because none of us can fully understand everything that other people went through.

☒

Vince later determined that his area of statistics wasn’t quite the tool we needed, and put us in touch with Rebecca. That was already a generous act, but he topped it a while later by asking that we remove him as an author because he felt he hadn’t contributed enough to the paper to be deserving as an author. I tried to convince him that he had already done more than enough to deserve a credit, but he politely stuck to his guns. I can sometimes lazily descend into the mire of cynicism, but when I see someone acting upon such deeply held morals, it reminds me how many good people there are in the world.

☒

Sometimes when a paper is rejected, one receives useful comments that help to improve the paper. I try not to complain about rejections, because it is almost never useful or edifying. However, I’m prepared to make an exception in this case, because, even 3 years later, I still can’t believe the following comments we received:

The unstated inference that extreme lengths are necessary in order to gain confidence in an empirical results is misplaced and counter-productive.

and:

Second, your paper overlooks the economics of research. The researcher’s job is _not_ to produce the most error-free result. Their job is _produce the most insightful research given the resources available_

I don’t think I’ve ever been scolded, before or since, for putting too much work into my research!

☒

The unstated inference that extreme lengths are necessary in order to gain confidence in an empirical results is misplaced and counter-productive.

and:

Second, your paper overlooks the economics of research. The researcher’s job is _not_ to produce the most error-free result. Their job is _produce the most insightful research given the resources available_

I don’t think I’ve ever been scolded, before or since, for putting too much work into my research!

From memory, it had 32 cores and 1TiB RAM — still, I think, the beefiest computer I’ve ever used.

☒

From memory, it had 32 cores and 1TiB RAM — still, I think, the beefiest computer I’ve ever used.

We then had to fix a single typo to produce the sixth, and final, version.

From memory, it had 32 cores and 1TiB RAM — still, I think, the beefiest computer I’ve ever used.

It also caused us to produce a number of stand-alone Rust libraries: cactus, packedvec, sparsevec, and vob. I’d like to particularly thank Gabriela Moldovan for her work on packedvec. Embarrassingly, as I wrote this footnote, I noticed that we failed to credit her in the paper. Sorry Gabi! Even though computing conferences abandoned printed proceedings several years back, they still impose stringent page limits on papers. This has two unfortunate effects: work that naturally needs more space than is allowed is chopped down until it does, compromising legibility; and work that needs less space is padded out to make it look more impressive, compromising legibility in a different way.

Or, at least, they should be a bespoke effort. Alas, there is a problem in academia with “salami slicing”, where what would best be presented in a single paper is instead split up into multiple papers. This is detrimental to the community as a whole but is done because the authors involved believe it will help their careers.

☒

We then had to fix a single typo to produce the sixth, and final, version.

From memory, it had 32 cores and 1TiB RAM — still, I think, the beefiest computer I’ve ever used.

This took a mere six and three quarter hours to generate!

☒

This took a mere six and three quarter hours to generate!

☒

Personally I hope that papers are “maintained”, at least for a while, in the same way that software is maintained. If someone points out a major flaw in a paper I’ve written, I hope I’ll go back and make a new version of the paper. There are limits to this: at some point, I will have forgotten most of what I once knew, and bit rot can make fixing things impractical. However, currently, there isn’t really a culture of reporting problems on papers, so I haven’t put this good intention into practice: until then, I’m not really sure how long I’m willing or able to maintain a paper for.

☒

Even though computing conferences abandoned printed proceedings several years back, they still impose stringent page limits on papers. This has two unfortunate effects: work that naturally needs more space than is allowed is chopped down until it does, compromising legibility; and work that needs less space is padded out to make it look more impressive, compromising legibility in a different way.

☒

The Evolution of a Research Paper

The warmup paper

The error recovery paper

Summary

Footnotes

Comments