A String Compare Bug

This is a bit of a departure from the current series. I encountered a bug last week that, while ultimately trivial to fix, was quite difficult to track down.

To start with, the root cause and how to fix it:

I rewrote my shader handling to work indirectly — instead of holding OpenGL handles directly, the renderer has an index into an array of shaders. The ultimate goal being that the asset system can recompile shaders at runtime, swap them out at the top of a frame, and the renderer is none the wiser.

In order to get that index in the first place, the renderer does a lookup using a name string. Asset names are stored in a NameString, which has a character pointer and size. However, the lookup query uses a raw C string. Herein lies the bug:

struct NameString {
    u32 len;
    char * str;

    bool operator==(char * other) {
        return (0 == strncmp(other, str, len));
    }
};

On first glance, this returns whether the strings match. However, it *actually* returns whether the first len characters of the string match. If str is a prefix of other, this will return true, even if the full strings don’t match.

The fix is to actually compute the length of the C string, and compare the lengths as well. As a precaution against bad data, we’d prefer strnlen over strlen. In order to make sure the check is correct, we use a maximum length of len + 1 for the strnlen — therefore, a longer string will still produce a longer length:

struct NameString {
    u32 len;
    char * str;

    bool operator==(char * other) {
        u32 other_len = strnlen(other, len + 1);
        return (other_len == len) && (0 == strncmp(other, str, len));
    }
};

(The *proper* fix, of course, is *don’t use strings for this*. That’s the direction I want to move towards.)

So that’s the error, but not quite the full story of the bug. I feel like it’s worth going into detail here, because it really underscored, to me, the way a subtle error can have wide, unexpected effects.

The problem originally showed itself as serious noise in the screen-space ambient occlusion (SSAO) buffer:
Screenshot 2017-12-15 01.52.21

My implementation of SSAO has two passes: the first takes the depth buffer, reconstructs normals, and then tests points in a hemisphere above the surface. The the percentage of the points that fall behind the depth buffer is used as the occlusion value. (The details get a bit messier but aren’t really relevant at the moment.) To get better coverage with fewer samples, the set of test points is rotated based on a noise texture tiled over the screen. The second pass takes this result and blurs it, using a blur exactly as big as our noise texture, which smooths out the result.

After determining I wasn’t getting NaNs or infinities anywhere, my initial thought was that the blur pass was simply ineffective. Noisy artifacts around depth discontinuities are expected when reconstructing normals from depth, and that’s similar to what we’re seeing here. The blurring *should* mitigate this.

In an attempt to fix it, I modified the blur shader to take the depth buffer as an input, and weight samples in the blur based on how close (depth-wise) they were to the current pixel. And it worked! I got a smooth(er) output:
Screenshot 2017-12-15 01.33.36

Of course, this wasn’t the actual error. (And this isn’t the correct output, either!)

I wanted to get a screenshot of the bug for this blog, so I commented out the changes to my blur shader… and nothing changed, the bug did not reappear.

Well, if I can’t *UN*fix a bug, I probably haven’t actually fixed it.

The one change I had not backed out was binding the depth buffer. Backing that out reintroduced the bug. What’s more, if I bound *another* texture (say, the color framebuffer), I got a glitchy-but-recognizable copy of that texture’s red channel in the SSAO map.

Exasperated, I went to bed.

The next morning, I finally sat down and stepped through in the debugger. And noticed, finally, that the bound shader was the same for both SSAO passes. Ah.

The main SSAO pass shader was named “ssao”, the SSAO blur shader was named “ssao_blur”. Due to the string comparison bug this post began with, both passes ran the main shader. The original visual artifact was due to having the depth input unbound while running the pass. (Obviously it was reading *something*, but I’m not sure what.) Once I bound the depth input for (what I thought was) the updated blur shader, the smoother output was just the main SSAO output.

Fixing the bug for real gives us the blurred version of that, and the correct output:
Screenshot 2017-12-19 21.28.16

There are a few things I could have done to catch this bug earlier and with less frustration.

– Warn or assert on invalid shader uniform names. I had expected the GL driver’s debug mode to catch this — obviously it didn’t. glGetUniformLocation returns -1 for uniform names not found in the shader, need to check that myself because the driver is perfectly happy binding to that location. (This would have caught the bug immediately, when “ssao_input” wasn’t found in the main SSAO shader.)

– First priority is to check the code I just changed, not code that is “obviously” related to the bug. The *only* issue with the SSAO pass was using the incorrect shader (the depth-sensitive blur helps, but wasn’t a big difference.) Even if I’m sure the code I just wrote only uncovered a bug, not caused it: verify that.

– I don’t understand a bug if I can’t force it to happen again.

– Use the debugger more. Use it!

Python Locale Heisenbugs

“Heisenbug” refers to a bug that only occurs under very specific circumstances. As I’ve been discovering this week, localization creates quite a few of these.

Localization is a tough problem to crack, and no language really gets it right. With the nearly infinite combinations of character encodings and regional formatting quirks, it’s nigh impossible to get it completely right. And as a result, there are always a few corners of each language that fail to work correctly under various locales.

Consider the following code:

try:
    with open(filename) as f:
        return f.read()
except IOError, ioerr:
    logger.error("Failed to open file: %s" % str(ioerr))

This has a few issues on a non-ascii system:
– it will not cope properly with the encoding of the file opened.
– it will not cope with any encoded characters in the error message from the system.

Here’s a good attempt at fixing it:

try:
    with codecs.open(filename, encoding="utf-8") as f:
        return f.read()
except IOError, ioerr:
    logger.error(u"Failed to open file: %s" % unicode(ioerr))

In reality, if an IOError is ever raised with a message containing encoded characters, the logging call fails, the exception propagates up the stack, and your program spews a traceback on stderr. This traceback, assuming it’s visible (and not redirected to /dev/null or similar) mentions only UnicodeDecodeError, not the original error opening the file.

What’s breaking? IOError.__unicode__ is implemented in a way that simply calls IOError.__str__ and attempts to convert the resulting string to unicode, using the ascii codec. So if the error message returned from the system is non-ascii, unicode() fails to convert it. Furthermore, unicode(), when called on an object, does not allow specifying a particular encoding.

So how do we fix this?

The intermediate fix, given that Linux systems use utf-8 encoding for error messages:

try:
    with codecs.open(filename, encoding="utf-8") as f:
        return f.read()
except IOError, ioerr:
    logger.error(u"Failed to open file: %s" % unicode(str(ioerr), 'utf-8'))

This works! …at least, until you try to run this code on a Windows machine. The standard encoding on Windows machines is not utf-8, and therefore we yet again get a UnicodeDecodeError.

The final, fixed code:

try:
    with codecs.open(filename, encoding="utf-8") as f:
        return f.read()
except IOError, ioerr:
    logger.error(u"Failed to open file: %s" % unicode(str(ioerr), sys.stderr.encoding))

Our best guess for the encoding in error messages is the encoding of stderr. In theory this may not be guarenteed, but this works across Linux and Windows, so we’ll call that good enough.

But wait, there’s more!

We’ve seen that messages from the system are suspect to encoding issues, but there are other instances where locale plays a critical but hidden role.

For example, take the function strftime(), a C function for formatting dates (taking locale into account). Python implements this via a method on date, time, and datetime objects.

Our initial, un-localized code:

>>> dt = datetime.datetime.now()
>>> print dt.strftime("%c")
Fri Apr 26 20:25:28 2013

The “%c” format specification formats the datetime according to the active locale, and returns a utf-8 encoded string. The following code will break, then, under any non-ascii locale:

logger.info(u"Timestamp: %s", dt.strftime("%c"))

When the datetime string is formatted into the unicode string, it is coerced to unicode by Python, using the ascii codec. As we’ve seen, this is a recipe for frustrating bugs.

Again, the solution is to specify the proper encoding:

logger.info(u"Timestamp: %s", unicode(dt.strftime("%c"), "utf-8"))

Complicating the issue, Python’s interpreter does not run under the system locale, instead running with the default ‘C’ locale — the result being that code that works in the interpreter may not necessarily work in a program:

>>> import datetime
>>> dt = datetime.datetime.now()
>>> print unicode(dt.strftime("%c"))
Fri Apr 26 20:51:00 2013
>>>
>>> import locale
>>> locale.setlocale(locale.LC_ALL, locale.getdefaultlocale())
'ja_JP.UTF-8'
>>> print unicode(dt.strftime("%c"))
Traceback (most recent call last):
  File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 4: ordinal not in range(128)

The lesson to be learned: If you care about localization, learn where your code is impacted by locale settings, and test against *non-ascii* locales. (No, “en_US.utf-8” doesn’t count). Python’s locale module makes this relatively straightforward.