A String Compare Bug

This is a bit of a departure from the current series. I encountered a bug last week that, while ultimately trivial to fix, was quite difficult to track down.

To start with, the root cause and how to fix it:

I rewrote my shader handling to work indirectly — instead of holding OpenGL handles directly, the renderer has an index into an array of shaders. The ultimate goal being that the asset system can recompile shaders at runtime, swap them out at the top of a frame, and the renderer is none the wiser.

In order to get that index in the first place, the renderer does a lookup using a name string. Asset names are stored in a NameString, which has a character pointer and size. However, the lookup query uses a raw C string. Herein lies the bug:

struct NameString {
    u32 len;
    char * str;

    bool operator==(char * other) {
        return (0 == strncmp(other, str, len));
    }
};

On first glance, this returns whether the strings match. However, it *actually* returns whether the first len characters of the string match. If str is a prefix of other, this will return true, even if the full strings don’t match.

The fix is to actually compute the length of the C string, and compare the lengths as well. As a precaution against bad data, we’d prefer strnlen over strlen. In order to make sure the check is correct, we use a maximum length of len + 1 for the strnlen — therefore, a longer string will still produce a longer length:

struct NameString {
    u32 len;
    char * str;

    bool operator==(char * other) {
        u32 other_len = strnlen(other, len + 1);
        return (other_len == len) && (0 == strncmp(other, str, len));
    }
};

(The *proper* fix, of course, is *don’t use strings for this*. That’s the direction I want to move towards.)

So that’s the error, but not quite the full story of the bug. I feel like it’s worth going into detail here, because it really underscored, to me, the way a subtle error can have wide, unexpected effects.

The problem originally showed itself as serious noise in the screen-space ambient occlusion (SSAO) buffer:
Screenshot 2017-12-15 01.52.21

My implementation of SSAO has two passes: the first takes the depth buffer, reconstructs normals, and then tests points in a hemisphere above the surface. The the percentage of the points that fall behind the depth buffer is used as the occlusion value. (The details get a bit messier but aren’t really relevant at the moment.) To get better coverage with fewer samples, the set of test points is rotated based on a noise texture tiled over the screen. The second pass takes this result and blurs it, using a blur exactly as big as our noise texture, which smooths out the result.

After determining I wasn’t getting NaNs or infinities anywhere, my initial thought was that the blur pass was simply ineffective. Noisy artifacts around depth discontinuities are expected when reconstructing normals from depth, and that’s similar to what we’re seeing here. The blurring *should* mitigate this.

In an attempt to fix it, I modified the blur shader to take the depth buffer as an input, and weight samples in the blur based on how close (depth-wise) they were to the current pixel. And it worked! I got a smooth(er) output:
Screenshot 2017-12-15 01.33.36

Of course, this wasn’t the actual error. (And this isn’t the correct output, either!)

I wanted to get a screenshot of the bug for this blog, so I commented out the changes to my blur shader… and nothing changed, the bug did not reappear.

Well, if I can’t *UN*fix a bug, I probably haven’t actually fixed it.

The one change I had not backed out was binding the depth buffer. Backing that out reintroduced the bug. What’s more, if I bound *another* texture (say, the color framebuffer), I got a glitchy-but-recognizable copy of that texture’s red channel in the SSAO map.

Exasperated, I went to bed.

The next morning, I finally sat down and stepped through in the debugger. And noticed, finally, that the bound shader was the same for both SSAO passes. Ah.

The main SSAO pass shader was named “ssao”, the SSAO blur shader was named “ssao_blur”. Due to the string comparison bug this post began with, both passes ran the main shader. The original visual artifact was due to having the depth input unbound while running the pass. (Obviously it was reading *something*, but I’m not sure what.) Once I bound the depth input for (what I thought was) the updated blur shader, the smoother output was just the main SSAO output.

Fixing the bug for real gives us the blurred version of that, and the correct output:
Screenshot 2017-12-19 21.28.16

There are a few things I could have done to catch this bug earlier and with less frustration.

– Warn or assert on invalid shader uniform names. I had expected the GL driver’s debug mode to catch this — obviously it didn’t. glGetUniformLocation returns -1 for uniform names not found in the shader, need to check that myself because the driver is perfectly happy binding to that location. (This would have caught the bug immediately, when “ssao_input” wasn’t found in the main SSAO shader.)

– First priority is to check the code I just changed, not code that is “obviously” related to the bug. The *only* issue with the SSAO pass was using the incorrect shader (the depth-sensitive blur helps, but wasn’t a big difference.) Even if I’m sure the code I just wrote only uncovered a bug, not caused it: verify that.

– I don’t understand a bug if I can’t force it to happen again.

– Use the debugger more. Use it!