Game Design, Programming and running a one-man games business…

Anatomy of a rare, and weird heisenbug in Democracy 4

Heisenbugs are bugs in code which seem to go away, or change when you look at them in a debugger. They are the worst possible bugs, because they actively resist being found, or even observed by the coder responsible. Any kind of bug that can be reproduced on a developer’s machine, in their development environment can easily be analyzed, stepped through and reasoned with. But heisenbugs are a nightmare. I just (I think) fixed one, and certainly fixed A bug, even if not this one. I thought you might enjoy the process…

First some background. Democracy 4 is the fourth and latest in my series of political strategy games. They are text and diagram heavy, with a properly unique user interface, in 2D, and all of the code is custom, with a custom engine originally coded by me, with a unicode text implementation and vector rendering system coded by jeff from stargazy studios. This means all of the UI elements you take for granted in middleware you use, were custom coded by us. In this case, the bug is my fault, in code written by me. Here is a screenshot:

Then key thing that matters there is the text block under ‘food stamps. This is a policy screen in the game, and that text is the description. Depending on the length of the description (partly influenced by the language you play in), and the screen resolution, it might be that all the text fits in the available space, or it does not. That means sometimes there is a scrollbar to the right of that text, and sometimes its not there. As you would imagine, the scrollbar works just like any windows scrollbar. That means it responds to mousewheels, and also clicks, on the top and bottom buttons (very small) for example.

…anyway…

VERY rarely. And ONLY when I was playing the game outside the debugger, and only in very arcane sets of circumstances that seemed totally utterly random… screens like this (but not just this one) would occasionally not let me click on some elements in the interface. They would not respond at all… BUT… that block of text would scroll upwards, like it was being scrolled through. It was like completely unrelated events (clicking somewhere else) activated a scroll function on a user element I was nowhere near.

This was of course, absolutely infuriating, because it was REALLY rare. And the times I tried crazily to reproduce it in the debugbuild, or even in the release build, but launched from Visual C++, it would not happen. And there were NO reliable steps to reproduce. In fact, even quitting a screen when it DID happen, and then returning to that screen would make it go away.

What the hell?

After a LOT of time, I realized two interesting things. Firstly, when it happened, the game DID make the click sound associated with a mouse click (so the game IS realizing I clicked somewhere) and the scrolling only happened with text inside a specific UI element called a GUI_TextContainer. Although I did not realize it at the time, it was also the case that it ONLY happened where that text container did NOT have a visible scroll bar.

The clue to the bug lies in a closeup of the scrollbars used:

Like all such bars, there is a clickable ‘scrollup’ button, a draggable bar, with above/below page up/down regions, and a clickable down scroll button.

When I create a new GUI_TextContainer, I created a new GUI_ScrollingWindow, which is a container with all of those elements in side. Like any child element, I added it to the Text Container object with AddChild() to ensure it responds correctly and is processed blah blah. Thjn, later when the textwindow is initialized with its position, and the relevant translated text, I do a lot of calculating to work out if that text *is* going to fit in the available space. If not, I set the scrollbar in the correct position on the right, and make a note that the text does not fit (sets BFits to false). Then later, when drawing this UI element, I could skip out a lot of nonsense I don’t need if the text fits, and render it simply as a word-wrapped string. If It does NOT fit, I then drew the scrollbar, and the text in a different way, cropping the text render to the space given the scroll position and so on, blah blah.

All of this sounds reasonable. But it was a big mistake.

I was correctly NOT drawing the scrollbar if it was not needed (why bother! the text fits!) but I had pre-emptively added it as a child of the window anyway. Worse still, I had never initialized the position of the scrollbar and its sub-components AT ALL. This means that in debug / release-from-the-development-environment, all of the relevant variables in that window were getting set to zero, But in a live ‘customer’ environment, the initial positions of those subcomponents could be ANYTHING.

So what could happen, really rarely, is that the ‘scroll up’ or ‘page up’ parts of the scrollbar, might have coordinates like -32054,-5128,27140,41543. In this case, those clickable elements filled the whole screen… but were INVISIBLE. Thus, I have a huge UI mess-up, but I cannot see it, because my ‘quick and easy bath’ that checked BFits, doesn’t draw it.

So very rarely, I create a screen with an invisible scroll up button that fills the entire screen, and happily responds to every click with a scroll-up command. It even nicely plays the button click sound. Luckily, these buttons do not handle keys, so the escape key still quits that whole screen, and the next time it gets created, I’m probably lucky and the invisible bar has moved somewhere harmless.

I’ve changed the code now, to be efficient so that I do not even THINK about creating this scrollbar until I need it, and otherwise its NULL. I also make sure to safely delete it before creating it, in case somehow I go mad and initialize the same text container twice with different blocks of text. I could also have fixed it by initializing the position of a scrollbar safely to 0,0,0,0, but I also like the neatness of ensuring I don’t have a pointless orphan UI element at any point.

This was a classic dumb error. I added an element when I shouldn’t have, and its member variables were not safe. My own dumb slackness. I document it clearly to illustrate how the weird unrelatedness of a bug ‘clicking down here, scrolls text up there. sometimes. only in German, on a Wednesday etc…’ eventually makes total sense.

Hundreds of thousands of people have played probably over a million hours of the game, with this bug in it. No code is bug free, but if you do things right, the ones left in are pretty arcane.


3 thoughts on Anatomy of a rare, and weird heisenbug in Democracy 4

  1. It always surprises me that it’s the debug build that initialises everything to 0, and the release build that leaves them as random junk.
    Surely it’d be safer the other way round, or consistent (one way or the other) or the debug build should force a garbage value.

    There may be a static analyser that’ll pull you up on uninitialised values, but sometimes they’re blind to unset OS objects.

    1. Sorry, not Intel! AIX was IBM. (AIX and Pains, as was commonly snarked. The occasional clever, esoteric compiler feature aside, it wasn’t what one would generally term a “good” OS.)

      1. (This was a followup to a previous comment, which I now see didn’t get posted even though its sequel did. I’m assuming the first was just held for moderation, so it’ll all make sense once it appears. If it never appears, then this is out-of-context and irrelevant. But I’ll try to remember to post the first part again, to resolve it.)

Comments are currently closed.