A Week of Bug Reporting

Recent posts
Structured Editing and Incremental Parsing
How I Prepare to Make a Video on Programming
pizauth: HTTPS redirects
Recording and Processing Spoken Word
Why the Circular Specification Problem and the Observer Effect Are Distinct
What Factors Explain the Nature of Software?
Some Reflections on Writing Unix Daemons
Faster Shell Startup With Shell Switching
Choosing What To Read
Debugging A Failing Hotkey

Blog archive

If you use software regularly, and have your wits about you, you will often realise that you’ve encountered a bug — in other words, that the software has done (or not) something that it shouldn’t (or should have). When this happens, the vast majority of people moan to the nearest bystander about how incompetent the people behind the software are and then carry on with whatever they were doing — often stumbling into the same bug repeatedly. Only a minority of people think of reporting bugs and, if they do think of it, they often fail to do so either because they think it’s too time-consuming or that they might end up looking silly.

It’s unfortunate that so few people report bugs. First, surprisingly few bugs affect a single person: by the time someone does report a bug, it can have affected hundreds or thousands of other people. We can’t all be freeloaders — someone has to take the plunge and report the bug. Second, bugs frequently appear in contexts unfamiliar to the software’s developers [1]. Most obviously, we all have slightly different software installations, and those variations can cause odd effects. Less obviously, users often use software in ways unimagined by its creators, such that some previously untested combinations of operations work incorrectly [2]. Either way, we cannot assume that the software’s developers will encounter the same bugs that we do: in many cases, if we don’t report bugs, the developers simply won’t know they exist.

Reporting bugs in someone else’s software is different from finding and fixing bugs in one’s own code, because we understand those other systems much less well. The sense of confusion I have with bugs in my own code is magnified when using other systems. I’m less confident that I’m actually seeing unexpected behaviour; less sure what the cause might be; and have lower trust than normal in any fix I propose. That means that we must acknowledge an important trade-off: a full understanding of the cause of a bug in an unfamiliar system is often beyond any reasonable investment of time; but if we just say “it doesn’t work”, no-one can be expected to replicate, let alone fix, the bug. Somewhere in the middle is a sweet-spot where the bug reporter invests the minimum amount of time to generate a good enough bug report that a developer can use to fix the bug.

While I wouldn’t exactly say that reporting bugs gets me out of bed each morning, I do feel something of a duty to do so. I have benefited hugely from the existence of open-source software, and reporting bugs is one of my ways of contributing a little back. Now that I’ve been doing so for years, reporting bugs holds many fewer fears, in terms of time or reputation, than it once did.

In this post I’m going to look at the bugs I reported in the first week of August on various pieces of open-source software. This is hardly a normal week, because I started it with a new laptop, which pretty much guaranteed that I’d hit more bugs than normal — but, as you’ll soon see, that was not the only source of bugs I encountered. Some of my reports have led to bugs being fixed, some haven’t. While I obviously hope for the former outcome, open-source developers aren’t under any obligation to investigate, let alone fix, my bug reports.

I hope it’s useful to see what happens when someone like me, who sometimes, but definitely not always, knows some details about the software I’m using, runs into what I believe are bugs. I’ve tried to record as much of the confusion that I went through as I can remember, because that seems an inevitable part of the process. Unsurprisingly, a lot of what I’ve written relates to my attempts to understand the effects of a bug and investigate their cause. I hope you also get a sense of how much my interactions with developers played a role in my and their understanding of a bug. Not only did my bug reports vary substantially in quality, but I was quite often wrong in my initial guess of the cause of a bug or what a good fix might be.

I’m going to present these bugs in roughly the order that I encountered them, because as you’ll see, there are more overlaps than might otherwise be apparent.

A new laptop

After 5 years of use, my old Thinkpad had reached the end of its useful life. The battery life had declined by around half, and more than once I ran out of power at an inconvenient time, damaging the battery further. Furthermore, its 2 physical cores were no longer up to the demands modern software places on them, causing me to twiddle my thumbs increasingly often. I’d been using Thinkpads for 19 years [3] and until recently assumed I would continue doing so. However, Framework’s modular approach to laptops, which seems to me a more environmentally sustainable approach to hardware, appeared to be both equivalently good, and more worthy of my support, so I ordered one with my hard-earned cash.

I knew that the first generation Framework was supported under OpenBSD (my operating system of choice) and crossed my fingers that the newly released second generation (which I’d bought) would work too. However, I was aware from experience that newly released hardware tends not to fully work on OpenBSD for a little while.

Video and keyboard oddities

Once I’d turned off “secure boot” in the BIOS, the OpenBSD installer booted and functioned correctly. However, the installer is text only and places relatively few demands on hardware — I knew that the real test would happen when I rebooted into the “full” installation. Fortunately, that seemed to go fairly well, and I started installing all the packages I’m used to.

I soon realised that the keys I was pressing on the UK keyboard I’d ordered weren’t always corresponding with what I was seeing on screen. I don’t remember that ever happening with OpenBSD before, so it took me a little while to realise that wsconsctl tells me the keyboard encoding. On the Framework I saw:

$ doas wsconsctl keyboard.encoding
keyboard.encoding=unknown_0

On my desktop, in contrast, keyboard.encoding=uk. I then tried to force the keyboard encoding to UK and immediately hit a kernel panic:

$ doas wsconsctl keyboard.encoding
uvm_fault(0xfffffd87b4e6a780, 0x8, 0, 1) -> e
kernel: page fault trap, code=0
Stopped at      wskbd_load_keymap+0x33: movl    0x8(%rax),%ecx
...

That’s not good news: it means the kernel has crashed. I was rather surprised, so rebooted and hit a kernel panic before I’d even typed anything in. I rebooted again, got to a terminal prompt and tried setting the keyboard encoding again — immediately causing another crash. I rebooted yet again, and then hit another kernel panic after a few seconds of usage (without so much as trying to set the keyboard encoding).

At this point I was a bit bamboozled, as I seemed to be hitting at least two separate bugs. Setting a keyboard encoding reliably caused an immediate kernel panic. But I was hitting kernel panics for some other reason that I couldn’t fathom. I then deliberately caused another panic after setting the keyboard encoding, took a photo of the kernel’s output, and filed a bug report to OpenBSD’s bugs mailing list, the main body of which is:

I’ve just got my sticky mits on a Framework 12th Gen Intel laptop. It somewhat works, but regularly hits kernel panics and other oddities on -current. I haven’t worked out why yet, but uvm_faults seem to be frequent. For example setting the keyboard encoding causes a uvm_fault.

$ doas wsconsctl keyboard.encoding=uk
uvm_fault(0xfffffd87b4e6a780, 0x8, 0, 1) -> e

Screenshot of this particular panic at https://postimg.cc/nM5rLGY6

I suspect that it’s simply that some hardware isn’t yet fully supported. There are a few more “not configured“s than normal in the dmesg and it even contains this, which is a new one on me (presumably PCI related?):

0:31:5: mem address conflict 0xfe010000/0x1000

It’s a deliberately short bug report. First, since I was quite possibly the first person to try OpenBSD on a second generation Framework, I knew that I shouldn’t expect things to work perfectly. Second, I noted down the one way that I could reliably cause a panic (setting a keyboard encoding) while noting vaguely that there seemed to be other causes too. I made a possible suggestion for the latter (which has later turned out to be irrelevant). But I then waited to see what the OpenBSD developers said, since they might have context about the likely status of OpenBSD on the Framework that I lacked.

Within three hours, Jonathan Gray, a prolific OpenBSD developer, had replied, though he wasn’t able to immediately solve the panics I was seeing.

In the meantime, by continually rebooting, I’d started to spot an intermittent correlation between graphics corruption and panics, which I used to update my report:

I’m also starting to understand a bit more about some of the random panics: they seem to happen very soon after X starts. Sometimes the mouse appears as a two inch square of weird colours (!) – things only last for a few seconds after that. Only once have I got a visible panic out of this which said solely:

uvm_fault(0xffffffff822f0400, 0xfff80001660014, 0, 1) -> e
drm: Global state not read locked
drm: Global state not read locked

Jonathan again quickly replied with a suggested fix. I applied the fix (which requires some basic knowledge about how to build OpenBSD from source), and decided that if I couldn’t trigger panics after 10 boots, I’d consider the problem probably fixed. 10 successful reboots later I was able to report:

A kernel with this patch has now survived 10 reboots (previously I was getting a panic perhaps 1 time in 3), so I think it’s quite likely that the effect is real. Thanks!

The fix was duly committed to OpenBSD less than 24 hours after my initial bug report!

Unfortunately the keyboard encoding problem has turned out to be rather harder to fix. Miod Vallat, another frequent OpenBSD developer, suggested a mechanism for preventing the keyboard encoding panic but noted that the problem should never have gotten that far. A separate developer sent me, via private email, a suggested debugging patch, but it didn’t show us anything useful. I then I spent 2 or 3 hours looking into how the OpenBSD kernel deals with keyboard encodings. I was surprised at how complex this is! The kernel has to use heuristics dealing with a variety of ropey hardware, and has to support USB keyboard at a point earlier in kernel setup than is ideal. Even if I am capable of understanding what’s going on here (which isn’t a given!), it seems likely that it would take me a long time to make a meaningful contribution. To my surprise, another OpenBSD user with the same laptop, but a US keyboard, has encountered no problems with keyboard encodings. Could there be an issue with the Framework UK keyboard or BIOS? It’s possible though probably not hugely likely. Either way, at the time of writing, the keyboard encoding is not detected automatically, and trying to set it manually still causes a kernel panic.

What’s interesting about these two bugs is that I had assumed the keyboard encoding panic would be easy to fix, and the other panics would be hard to fix. Instead the opposite has turned out to be true! This is no reflection on any of the developers involved; it’s hard for those of us who only roughly understand a piece of software to correctly guess how difficult it will be to identify the cause and fix for any given bug.

Clickpad misidentification

As part of my report on video oddities, Jonathan extended the OpenBSD kernel to identify some of the Framework’s previously unencountered hardware (probably in this patch). That changed the way the trackpad is identified by the kernel – to my surprise, the relevant standards differentiate between “trackpads” and “clickpads” – from a clickpad to a touchpad. Suddenly the right-hand mouse button on what I call a trackpad (but the relevant standard calls a “clickpad”) stopped working.

I emailed a developer who I thought might know something about this, but after not hearing anything, decided to investigate myself. I have made a few minor device contributions to the OpenBSD kernel over the years, so I thought I might be able to at least work out where the problem is. A quick grep showed that the following piece of code (in hidmt.c) was where clickpads and touchpads are differentiated:

/* find whether this is a clickpad or not */
if (hid_locate(desc, dlen, HID_USAGE2(HUP_DIGITIZERS, HUD_BUTTON_TYPE),
  mt->sc_rep_cap, hid_feature, ∩, NULL)) {
    d = hid_get_udata(rep, capsize, ∩);
    mt->sc_clickpad = (d == 0);
} else {
  /* if there's not a 2nd button, this is probably a clickpad */
  if (!hid_locate(desc, dlen, HID_USAGE2(HUP_BUTTON, 2),
    mt->sc_rep_input, hid_input, ∩, NULL))
      mt->sc_clickpad = 1;
}

Sure enough, if I forced mt->sc_clickpad=1 I got back the old behaviour. My first assumption was that the heuristic might be wrong, so I started turning DEBUG variables on in various kernel files. This caused some comedy when I turned USBHID_DEBUG on in hid.c — it caused so much output to the console that after 15 minutes the machine still hadn’t finished booting!

Since that route didn’t work, I added a few printfs to the if/ else above and reran the kernel: that showed that the else branch was being executed. How many buttons does my clickpad have? By copying and pasting some code from hidms.c, I discvered that it’s reporting 4 buttons, which isn’t what I was expecting at all, and which defeats the “if there’s not a 2nd button” heuristic above.

I then spent an hour flailing around for possible solutions, until I found this pull request against libinput. This showed me that, on Linux, the touchpad on the Framework – which I now realise is a PixArt Imaging PIXA3854 – has what kernel drivers call a “quirk” (i.e. a special case for that devic). By this point I was probably about 2 hours into the problem.

That suggested to me that I would need to add a similar quirk to OpenBSD. It then took me another 2 hours or so to work out if, and where, the vendor and product ID of an IMT device might be stored in the OpenBSD kernel. I kept going around in circles [4], before realising that though hidmt.c performed the clickpad/touchpad test, imt.c seemed the right place to check vendor and product IDs. This was what I ended up with:

struct i2c_hid_desc *hid_desc = &sc->sc_hdev.sc_parent->hid_desc;
if (hid_desc->wVendorID == 0x093A && hid_desc->wProductID == 0x274)
  // ...at this point we have a PIXA3854

Once that was done, the rest fell into place quickly and, after a little bit of testing and cleaning up, I sent an email plus patch to OpenBSD’s tech mailing list:

The Framework clickpad (a PixArt PIXA3854) announces that it has 4 buttons which defeats the normal heuristic of “2 or more buttons means it’s a touchpad”. When it’s identified as a touchpad, right hand mouse clicks don’t work (apart from that, I can’t tell any difference between clickpad and touchpad in operation!). Linux/libinput also have a quirk for this device [1], although they simply disable the second button.

The patch at the end of this mail adds a quirk that detects the PIXA3854 and forces it to be identified as a clickpad. To say that I am unfamiliar with these parts of the kernel is an understatement: this patch works for me, but I don’t know whether it’s the best, or even an acceptable, way of dealing with “right hand mouse clicks don’t work” on this particular device!

The first paragraph clearly outlines the bug [5], and how other people have fixed it. The second paragraph makes clear that, although my fix works for me, I didn’t have total confidence that it was the best (or, possibly, even a correct) way to fix the problem.

And then, of course, I realised that I had jumped into this problem so quickly that while I had checked OpenBSD’s bugs mailing list to see if anyone had encountered this problem, I hadn’t checked the tech mailing list itself. Sure enough, Joshua Stein (who got OpenBSD working well on the first generation Framework), had encountered the same problem and proposed a fix too! Admittedly, that fix doesn’t seem to have been accepted, and indeed the last message seems to suggest that a fix along the lines of what I proposed was probably better. But, still, what an idiot I am — it was only by pure luck that the 4 or 5 hours it took to create my patch weren’t entirely wasted! I forgot the first law of bug fixing: check carefully to see if someone else has stumbled across, and maybe even fixed, the problem. I had checked — but too quickly.

After I posted my patch, I was contacted off-list by someone familiar with this part of OpenBSD’s kernel. After careful reading of the relevant specification he had come up with a more general patch which avoided the device-based quirk I had created. I soon confirmed that his patch worked for me and, to the extent I can tell, looked like it would not break existing devices. At the time of writing neither of our patches has been included in OpenBSD, though I hope that one will be soon.

A broken iCalendar file

While I was setting up the Framework, I installed Thunderbird, which I use solely as my calendar app (I use neomutt for email), and imported the two iCalendar (.ics) files I use. The small calendar from my council (telling me when to put out my recycling bins) imported fine. However, my main calendar file, whose 3600 entries span over a decade, didn’t work: importing did not raise an error but neither were any entries displayed. I tried a couple of times and then put it to one side, thinking I’d try again another day.

A day or two later, I happened to upgrade the packages on my desktop OpenBSD machine and quickly realised that my main calendar wasn’t displaying there either. A quick investigation showed that Thunderbird had been upgraded from version 91.11.0 to 102.1.0. I wondered if my calendar file had been corrupted in the upgrade, but a quick check of its git history showed nothing obvious.

I then guessed that Thunderbird had become pickier about ingesting iCalendar files. I tried looking for a “convert iCalendar to iCalendar” program, hoping that it might “upgrade” my file. I soon stumbled across the Python icalendar library. However that too failed to import my file:

Traceback (most recent call last):
  File "/tmp/conv.py", line 4, in
    Calendar.from_ical(d)
  File "/usr/local/lib/python3.9/site-packages/icalendar/cal.py", line 383, in from_ical
    vals = factory(factory.from_ical(vals))
  File "/usr/local/lib/python3.9/site-packages/icalendar/prop.py", line 525, in __init__
    tzid = tzid_from_dt(start)
  File "/usr/local/lib/python3.9/site-packages/icalendar/parser.py", line 54, in tzid_from_dt
    if hasattr(dt.tzinfo, 'zone'):
AttributeError: 'datetime.date' object has no attribute 'tzinfo'

I then looked at the Rust icalendar library, which imported my file, but which (from memory) produced output identical to the input, so Thunderbird still couldn’t import the result.

I started to become a bit nervous: my calendar is rather important to how I go about my life! I then tried using Orage, which can do iCalendar conversions, but it objected to all sorts of things in my calendar. I then tried kOrganizer, a program I used for many years [6], until it started crashing on me: unfortunately it segfaulted when trying to import my calendar. I then tried GNOME Calendar which not only imported my calendar correctly but did so stunningly fast — I thought I might have a winner! I then spent a little bit of time exploring GNOME Calendar before realising that I didn’t know how to get it to raise notifications in XFCE (I need multiple warnings if I want to make appointments!). I also wasn’t sure how to make appointments in different timezones (I’m sure that it has this functionality, but I didn’t immediately stumble across it).

At this point I thought I should perhaps look back at Python’s icalendar library. I started poking around in its innards, putting print statements around the lines in the backtrace I saw. Soon I noticed the following comment in the code:

# does not support different timezones for start and end

That gave me a clue and I put a print statement of the date at that point. Just before the backtrace that printed a single line:

2022-06-06

Since iCalendar files are plain text, it only took me a moment to work out which entry this corresponded to (with some minor details elided for privacy reasons):

BEGIN:VEVENT
UID:4ABCDEF
DTSTAMP:20220531T190736Z
DTSTART;VALUE=DATE:20220606
DTEND;VALUE=DATE:20220608
SUMMARY:Blah blah.
FREEBUSY;FBTYPE=FREE:20220606/20220608
TRANSP:TRANSPARENT
END:VEVENT

Removing that single entry allowed Thunderbird to import my entire calendar file! A quick search through my email showed me that this iCalendar entry had been sent to me when I’d booked a hotel room. An old version of Thunderbird had happily imported this and then copied it into my main calendar file, but newer Thunderbird could not import the entry (either on its own, or as part of my main calendar).

At this point, I had my culprit — but why was it causing problems? Since I didn’t know anything about the format of an iCalendar file, I looked at a neighbouring entry (again with minor details elided for privacy reasons):

BEGIN:VEVENT
CREATED:20220513T220740Z
LAST-MODIFIED:20220513T220751Z
DTSTAMP:20220513T220751Z
UID:5BCDEG
SUMMARY:Blah blah
DTSTART;TZID=Europe/London:20220606T143000
DTEND;TZID=Europe/London:20220606T180000
TRANSP:OPAQUE
END:VEVENT

Between the comment in the icalendar code and the neighbouring entry, the answer jumped out at me: the faulty entry did not include timezone information in its DTSTART entry!

It then took me a little while to work out where I might report bugs in Thunderbird since, it turns out, I was really using a separate component called “Calendar”. Eventually I filed the following report:

Thunderbird used to be able to import ics files whose DTSTART did not have a timezone. Having just upgraded to v102 it is unable to import them. Here’s an example I received from a booking site a couple of months ago:

BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Avvio//NONSGML Convert//EN
BEGIN:VEVENT
UID:4ABCDEF
DTSTAMP:20220531T190736Z
DTSTART;VALUE=DATE:20220606
DTEND;VALUE=DATE:20220608
SUMMARY:Blah blah.
FREEBUSY;FBTYPE=FREE:20220606/20220608
TRANSP:TRANSPARENT
END:VEVENT
END:VCALENDAR

Interestingly, not only did Thunderbird used to be able to import them, but it then copied them into my “main” calendar without adding a timezone. So I had a “merged” calendar which Thunderbird v102 refused to load. However, Thunderbird gave no indication as to why it wouldn’t load – there were just no calendar entries. I eventually managed to track down the offending error because py-icalendar couldn’t parse that DTSTART value either!

Expected results:

IMHO the original ics file is incorrect, but since Thunderbird previously accepted it, there will be other people like me with an ics file that they assume is created by Thunderbird, but which has imported invalid value.

Ideally Thunderbird would at least explicitly complain to the user if it encounters an ics file it can’t parse. In this (admittedly unusual) case of a missing timezone it could also either ask the user what timezone they want or (though this has obvious problems!) resort to the local timezone.

The problem with importing this file was soon confirmed by other people, and I felt happy knowing that I’d reported a bug with a simple test case that allows more knowledgeable people to work out if/how they might fix things.

However, a few days later I could not dispel a nagging thought: was I sure that the missing timezone information was the culprit? The evidence I’d used felt, in retrospect, rather circumstantial. I looked through my main iCalendar file again and, sure enough, there are many other entries whose DTSTART entry doesn’t have a timezone. Clearly I had been too hasty in my initial assessment!

I then did what I should have done in the first place, and started chopping lines out of the input and seeing if Thunderbird could import the result or not. Very soon I realised that the problem was actually the FREEBUSY line [7]: removing that line allows Thunderbird to import the entry.

At this point I realised, with slight embarrassment, that my original bug report had been misleading. At such times my ego tempts me to slink off and hope no-one notices my mistake, but that is hardly fair to those who are looking into my report. I thus revised my report with my new finding and updated the bug title [8]. The Thunderbird developers then crunched away, gradually working out what caused the bug, before committing a fix. I updated my bug report to say thanks as I always try to do (though I sometimes forget) — I feel it’s important to acknowledge other people’s efforts in solving a problem that doesn’t affect them directly.

In retrospect, I feel I was rather lucky to stumble a good test case for this bug. If Python’s iCalendar library hadn’t also choked on the same entry [9], I don’t know what I would have done. It’s unlikely that I would have tried diving into Thunderbird’s innards. Perhaps I might have tried bisecting my iCalendar file, though that would have been quite a bit more effort.

quodlibet and libsoup[23]

For several years I’ve used quodlibet as my music player, in no small part because it has a good “skip album” facility that gels well with my preferred listening style. After the same upgrade that caused my Thunderbird woes, something changed which caused quodlibet to abort on load, complaining that “Using libsoup2 and libsoup3 in the same process is not supported”. I don’t think I’d even heard of libsoup before and I certainly didn’t know I had two separate versions installed!

I quickly went rooting around in OpenBSD’s ports system [10], with which I’m fairly familiar. quodlibet had a direct run-time dependency on libsoup2 and so I did the obvious thing of changing this to libsoup3. Surprisingly, this seemed to make things work, even though I couldn’t see how that would have affected the binary package. My subsequent bug report was therefore explicit that I wasn’t convinced that my fix was correct:

Fortunately it seems that simply updating RUN_DEPENDS from libsoup to libsoup3 solves the problem and I end up with a quodlibet that runs (although I’m slightly unsure *why* this solves things, as I don’t know where quodlibet would pick up the impact of RUN_DEPENDS, but that’s probably my ignorance). If someone who understands this could check whether this change is sensible or not, I’d be grateful! Patch at the end of this email.

Sure enough it was soon confirmed by Stuart Henderson that my “fix” wasn’t directly making a difference. Stuart then realised that upgrading the version of quodlibet in ports from 4.4.0 to 4.5.0 sidestepped the libsoup problem. Unfortunately for me quodlibet 4.5.0 only half-worked: on most actions I got the follwind error, though quodlibet would other work:

AttributeError: 'gi.repository.Soup' object has no attribute 'URI'
------
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/quodlibet/util/cover/manager.py", \
line 120, in failure  run()
  File "/usr/local/lib/python3.9/site-packages/quodlibet/util/cover/manager.py", \
line 136, in run  provider.fetch_cover()
  File "/usr/local/lib/python3.9/site-packages/quodlibet/util/cover/http.py", line \
86, in fetch_cover  if not self.url:
  File "/usr/local/lib/python3.9/site-packages/quodlibet/ext/covers/discogs.py", \
line 63, in url  artist = escape_query_value(artists)
  File "/usr/local/lib/python3.9/site-packages/quodlibet/util/cover/http.py", line \
111, in escape_query_value  return Soup.URI.encode(s, '&')
  File "/usr/local/lib/python3.9/site-packages/gi/module.py", line 123, in \
__getattr__  raise AttributeError("%r object has no attribute %r" % (
AttributeError: 'gi.repository.Soup' object has no attribute 'URI'

At first, I fired off a quick report in case anyone was seeing the same problem. I then forgot about the problem for a few days; when I came back to it, quodlibet was still suffering from the same problem.

When I looked a little more carefully, the Python backtrace gave me a clue as to the likely problem: and, sure enough, if I simply added import Soup.URI to cover/http.py I no longer saw the continual backtraces. A few minutes later I had made a “proper” patch for the OpenBSD port and submitted it to the mailing list.

At this point I was feeling quite smug: I’d solved the problem in such a simple way! Fortunately, the rather more vigilant Stuart noted that my fix simply masked the problem: I had actually introduced an error which caused a different sub-module to be used instead. Even though I’m fairly familiar with Python, I don’t think I’ve ever looked inside quodlibet or gobject and friends — it simply didn’t occur to me that my patch could fail in the way it did.

It soon became clear that we had stumbled across a deeper problem that would probably affect other software too. Fortunately for me, Stuart used his considerable skills to start tackling the problem more fundamentally. I tested some patches, and made additional suggestions, in private email, but at this point I was happy that someone more knowledgeable had taken charge! I was even more happy when the eventual fix was to add RTLD_NOLOAD support to OpenBSD’s dynamic linker, since that also meant that I could finally build mold on OpenBSD!

gimp crashing

I have the artistic skills of a flat-footed toad but, from time to time, I need to edit bitmap images. When I need to do so, I use GIMP. Mostly I use it for simple things like scaling, cropping, and (if I’m feeling adventurous) basic colour correction. In the past I’ve experienced segfaults when using GIMP, but they were infrequent. At some point in the past few weeks, they had become very frequent. At first I thought this might be because I often run GIMP from neomutt, but when I ran it directly from a terminal, I still saw regular crashes.

After this happened to me several times while changing the brightness of a photo, I decided that I should at least work out if anyone else was seeing the same problem as me. I suspected that these crashes were not directly GIMP’s fault, but were probably somehow OpenBSD related. Unfortunately I couldn’t work out a simple trigger. Sometimes, for example, I’d see a crash when I saved a file — but not always, even if I followed the same sequence of steps. I therefore did something quick and simple: I ran gdb on gimp, waited until it crashed and then printed a backtrace. From that I posted a brief bug report to OpenBSD’s ports mailing list:

I’m not a heavy user of gimp, but I’ve noticed it segfaulting regularly (e.g. when saving, adjusting colours) whenever I’ve tried using it in the last couple of weeks or more. Having updated my amd64 snapshot + packages this morning to (sic) I’m still seeing problems. Here’s an example backtrace from github [11] which suggests the problem might be in bable? I wonder if anyone else has seen this problem, or has any idea what it might be (or what might have caused it)?

Program received signal SIGSEGV, Segmentation fault.
[Switching to thread 523403]
0x00000445260f63ec in rgbaf_to_Labaf_sse2 ()
   from /usr/local/lib/babl-0.1/x86-64-v3-CIE.so
(gdb) bt
#0  0x00000445260f63ec in rgbaf_to_Labaf_sse2 ()
   from /usr/local/lib/babl-0.1/x86-64-v3-CIE.so
#1  0x00000444488ef45c in babl_fish_lut_process_maybe ()
   from /usr/local/lib/libbabl-0.1.so.1.8
#2  0x00000444488eddab in babl_fish_path_process ()
   from /usr/local/lib/libbabl-0.1.so.1.8
#3  0x00000444488ee04c in babl_process_rows ()
   from /usr/local/lib/libbabl-0.1.so.1.8
#4  0x0000044526c67852 in gegl_buffer_iterate_read_simple ()
   from /usr/local/lib/libgegl-0.4.so.0.5
...

It is not a great bug report, although it isn’t the worst I’ve ever filed. I hoped that other people might realise they were seeing similar problems and that might, perhaps, help us narrow down what’s going on. No-one else replied, perhaps because the problem only affects me, or perhaps because, like me, most people felt only mildly irritated by the problem. At this point I decided to cut my losses and move on. When I tried GIMP again a week or so later, the problems seemed to have disappeared, and then a week later, they seemed to be back. Why? I have no idea!

neovim-qt

It’s not unusual for bug reports to take weeks, or longer, to resolve. In early July I filed a bug report against neovim stating that Shift-Backspace no longer worked. It soon transpired that it wasn’t neovim but neovim-qt that had changed behaviour (the bug report briefly shifted to the neovim-qt repository before shifting back to the neovim repository).

To cut a long story short, some users don’t like the fact that odd characters are inserted if they accidentally press Shift-Backspace in neovim’s terminal mode, so neovim-qt had started to ignore Shift-Backspace entirely. However, some odd people like me use Shift-Backspace in normal editing mode [12].

The neovim-qt authors have thus found themselves caught between two seemingly incompatible user desires. After suggesting a couple of probably impractical solutions, in the week relevant to this blog post I finally came up with a possible solution that might keep everyone happy — neovim-qt could bind Shift-Backspace to a no-op by default. It’s a compromise, but it’s not the worst compromise I’ve ever suggested!

Note that my suggestion came over a month after I filed the initial report. This isn’t a reflection on the neovim-qt or neovim authors: it’s an example of a hard bug where no-one involved can think of an easy fix, and so things rumble on for quite some time.

Closing thoughts

I’ve developed, and maintain, quite a bit of open-source software. It is possible to view bug reports as ego-bruisers, each reminding me that I am incapable of creating correct software. Frankly, I’ve largely come to terms with my own limitations in that regard and now when a new bug report comes in I think only about its impact on my time and code quality. At the very least I have to understand whether I think the report is correct or not; if it is correct I have to decide if I want to try fixing it; and then if I decide to try fixing it, I have to spend however long that takes, and also do my best to make sure it doesn’t break anything else.

I therefore aim to be sympathetic to other developers when I file bug reports. Even if they dismiss my report in 30 seconds, I’ve still taken 30 seconds of their time — and many developers will cheerfully spend hours trying to understand and fix bugs. One might perhaps think that the only ethical thing I can do is to spend the time to fully fix a bug before reporting it — but, even if I was sufficiently skilled, I don’t think I have enough waking hours to do that!

There is thus an interesting tension when reporting bugs: the more effort I put in, the more likely the bug is to be resolved, but I can’t put maximum effort into every bug. Instead, I have to triage the bugs that I encounter before deciding which to report, and how far to go in my reporting. The major factors I consider are the severity of the bug, how often it affects me, my ability to sensibly describe it, and whether I think I can fix it.

My familiarity with a program is a crucial factor: if it’s a new program to me, I might struggle to describe what I’ve done to encounter a bug. If I think I should try making a fix, my familiarity with the program’s innards is key: though I’m far from an expert, I know parts of the OpenBSD kernel code well enough to fancy my chances of fixing certain minor hardware bugs; but I have never looked at Thunderbird’s source code, and I wouldn’t even know where to start!

Familiarity with, and within, a community also makes a difference. Though decidedly on the periphery, I’ve been semi-active on OpenBSD mailing lists for over 20 years. Not only do I know where to report different kinds of bugs, but I have a good understanding of that community’s expectations. I can also rely on some developers remembering my name which is one reason why, when someone suggests a fix to a bug I’ve filed, I always make sure that I test and report back on it: even if people think I’m an idiot, it can’t hurt if they at least think I’m a conscientious idiot. In contrast, it took me 20 minutes just to work out where [13] to report the Thunderbird bug, and I had no idea how it would be received. It was pleasing that the bug was taken seriously.

Although it’s often forgotten about, or resented because of the suggestion of an obligation, developer responsiveness is a significant factor in how often, and how well, people report bugs. Frankly, there are few things more demotivating than spending time filing a good quality bug report and hearing nothing back. Not only does it not make it less likely that I’ll report further bugs, but it makes it more likely that I’ll move to an alternative that is more actively maintained. In case there’s any doubt, I want to emphasise again that open-source developers don’t owe me, or anyone else, anything. However, I’ve encountered a number of developers who are disappointed that their user base is smaller than they would like it to be, but who take months to respond to good quality bug reports, if they bother at all. We might not like the suggestion of a link between those two factors, but it seems to me inevitable.

You can see several of the factors above across the bug reports in this post [14]. For example, since I don’t use GIMP a great deal, I wasn’t willing to spend much time on finding the root cause of the bug. In contrast, since I really did want to be able to use the right-hand mouse button on my laptop I was willing to spend a lot of time finding a fix.

Over time I’ve tended to divide bug reports into the following categories, from “highest” to “lowest” quality:

  1. A test case and a complete fix.
  2. A test case and a partial fix.
  3. A test case alone.
  4. A partial description of how to trigger the bug.
  5. A description of the effects of the bug.
  6. No useful details (in essence “it doesn’t work” and nothing more).

I would consider bug reports in #1 to be “definitive”; #2 and #3 are often enough for developers to work out what’s happened and if/how a fix might be developed; #4 and #5 can go either way; and #6 is worse than useless. My sense is that reports in #6 become disproportionately more common in more popular projects, perhaps as the average user becomes less experienced?

In general, I would like to file bug reports in at least category #3 or above, but often all I can manage is #4 or #5 (e.g. the video display problems on the Framework), and sometimes all I’m willing to do is #5 (e.g. the gimp bug). Of course, the lower the quality of my report, the less likely it is that other developers will care, or that they will be able to do anything with my report. This is no reflection on them: I have paid nothing for my open-source software and have no expectation of anything in return.

One thing that I’ve realised from writing this post is that the lower the category of my initial report, the less likely it is to be correct. My initial report of the Thunderbird iCalendar bug was ostensibly in #3 but since I pinpointed the wrong cause, arguably it was in #4: only after I realised my blunder did it really move into #3. Did I file my initial reports too soon? Perhaps — but if I wait too long in reporting a bug, I tend to forget important details, and sometimes lose motivation to report anything at all. I’m not quite sure where the right balance is.

One thing you can clearly see from this post is that even good bug reports, whatever their initial category, rarely involve just the initial report. There is often a continuing dialogue between reporter and developer as the latter tries to understand what has happened and the former helps narrow down causes or tries possible fixes. Frequently, that dialogue will go on for days, and sometimes weeks. Knowing when to ping people to reanimate a bug report is a real skill, and not one that I always get right!

Finally, and despite what this post might suggest, I am firmly of the opinion that modern software has obtained an astonishingly high level of quality. I suspect that if I was to try replicating Dan Luu’s 2014 One Week of Bugs experiment now that I would encounter fewer bugs — and fewer bad bugs. Better development processes (e.g. continuous integration, more extensive testing) are one factor behind this improvement. But another factor is the cumulative effect of people taking the time and effort to report bugs. I consider the time I’ve spent reporting bugs to be time very well spent!

Acknowledgements: thanks to Edd Barrett, Dan Luu, Davin McCall, and Hillel Wayne for comments.

Newer 2022-08-31 08:00 Older
If you’d like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

When writing this post I vacillated between using “developers” or “maintainers”. To some extent the two terms are synonyms in our context, but not quite — I actively “develop” some software but minimally “maintain” others. On balance, I thought that more people might feel themselves “developers” than “maintainers”, but if you prefer to read the post using the latter term, please feel free to do so!

When writing this post I vacillated between using “developers” or “maintainers”. To some extent the two terms are synonyms in our context, but not quite — I actively “develop” some software but minimally “maintain” others. On balance, I thought that more people might feel themselves “developers” than “maintainers”, but if you prefer to read the post using the latter term, please feel free to do so!

[2]

Although I don’t have hard data to back this up, my intuition is that most long-lived bugs are the result of correctly implementing an ill-thought through specification. In most cases, I suspect the specification is incomplete, though in some cases it is simply wrong. That’s not a criticism: most software is necessarily bespoke and we struggle to understand what it should really do until it’s actually in our hands.

Although I don’t have hard data to back this up, my intuition is that most long-lived bugs are the result of correctly implementing an ill-thought through specification. In most cases, I suspect the specification is incomplete, though in some cases it is simply wrong. That’s not a criticism: most software is necessarily bespoke and we struggle to understand what it should really do until it’s actually in our hands.

[3]

In order: a T40 (decent), a T43 (good), a T410s (terrible), an X220 (excellent), an X1 Carbon 3rd gen (good), and finally an X1 Carbon 6th gen (very good). Apart from the T410s, which had a terrible screen, terrible battery life, and was far too heavy, I had a very good run with Thinkpads.

In order: a T40 (decent), a T43 (good), a T410s (terrible), an X220 (excellent), an X1 Carbon 3rd gen (good), and finally an X1 Carbon 6th gen (very good). Apart from the T410s, which had a terrible screen, terrible battery life, and was far too heavy, I had a very good run with Thinkpads.

[4]

I was very lucky that a) neovim’s language server support is now very good b) I had clangd installed. Navigating aorund this very unfamiliar part of OpenBSD’s kernel was thus much faster than it would otherwise have been. I’d probably have given up a couple of years ago.

I was very lucky that a) neovim’s language server support is now very good b) I had clangd installed. Navigating aorund this very unfamiliar part of OpenBSD’s kernel was thus much faster than it would otherwise have been. I’d probably have given up a couple of years ago.

[5]

Because I don’t use it very often, I didn’t notice that the middle mouse button also stopped working when the trackpad was not identified as a clickpad.

Because I don’t use it very often, I didn’t notice that the middle mouse button also stopped working when the trackpad was not identified as a clickpad.

[6]

As a result of my time using korganizer, my calendar is still ~/.korganizer.ics!

As a result of my time using korganizer, my calendar is still ~/.korganizer.ics!

[7]

Exactly why Thunderbird dislikes this line is unclear to me, but looking at the specification for this property makes me wonder if the lack of timezones here is actually the problem. But I’m not going to cry wolf about missing timezones a second time!

Exactly why Thunderbird dislikes this line is unclear to me, but looking at the specification for this property makes me wonder if the lack of timezones here is actually the problem. But I’m not going to cry wolf about missing timezones a second time!

[8]

I soon received a very nice private email from a Thunderbird developer thanking me for updating my report!

I soon received a very nice private email from a Thunderbird developer thanking me for updating my report!

[9]

Writing this, I realise that I should also file a bug against the Python library!

Writing this, I realise that I should also file a bug against the Python library!

[10]

Binary “packages” are created from descriptions called “ports”.

Binary “packages” are created from descriptions called “ports”.

[11]

I clearly meant to write “gdb” here, but I spend half of my life reviewing GitHub pull requests, so presumably that was on my brain as I typed!

I clearly meant to write “gdb” here, but I spend half of my life reviewing GitHub pull requests, so presumably that was on my brain as I typed!

[12]

For me it means “delete previous word”, a keyboard shortcut I suspect I picked up from StrongED 25 years ago, although I can no longer remember for sure.

For me it means “delete previous word”, a keyboard shortcut I suspect I picked up from StrongED 25 years ago, although I can no longer remember for sure.

[13]

The widespread use of GitHub, and its issue tracking has made it much easier, in general, to work out where to file bug reports for arbitrary open-source software. This is not to say that GitHub’s issue tracking is perfect – indeed, it has a number of flaws and limitations – but at least I know what to expect.

The widespread use of GitHub, and its issue tracking has made it much easier, in general, to work out where to file bug reports for arbitrary open-source software. This is not to say that GitHub’s issue tracking is perfect – indeed, it has a number of flaws and limitations – but at least I know what to expect.

[14]

In this post I’ve detailed 4 (or, depending on how you count things, 5) new bug reports I made in a single week. I also included an additional ongoing bug report I was involved in. I haven’t included bug reports that people made on my software in that week — I think this post is quite long enough already!

In this post I’ve detailed 4 (or, depending on how you count things, 5) new bug reports I made in a single week. I also included an additional ongoing bug report I was involved in. I haven’t included bug reports that people made on my software in that week — I think this post is quite long enough already!

Comments



(optional)
(used only to verify your comment: it is not displayed)