Python Locale Heisenbugs

“Heisenbug” refers to a bug that only occurs under very specific circumstances. As I’ve been discovering this week, localization creates quite a few of these.

Localization is a tough problem to crack, and no language really gets it right. With the nearly infinite combinations of character encodings and regional formatting quirks, it’s nigh impossible to get it completely right. And as a result, there are always a few corners of each language that fail to work correctly under various locales.

Consider the following code:

try:
    with open(filename) as f:
        return f.read()
except IOError, ioerr:
    logger.error("Failed to open file: %s" % str(ioerr))

This has a few issues on a non-ascii system:
– it will not cope properly with the encoding of the file opened.
– it will not cope with any encoded characters in the error message from the system.

Here’s a good attempt at fixing it:

try:
    with codecs.open(filename, encoding="utf-8") as f:
        return f.read()
except IOError, ioerr:
    logger.error(u"Failed to open file: %s" % unicode(ioerr))

In reality, if an IOError is ever raised with a message containing encoded characters, the logging call fails, the exception propagates up the stack, and your program spews a traceback on stderr. This traceback, assuming it’s visible (and not redirected to /dev/null or similar) mentions only UnicodeDecodeError, not the original error opening the file.

What’s breaking? IOError.__unicode__ is implemented in a way that simply calls IOError.__str__ and attempts to convert the resulting string to unicode, using the ascii codec. So if the error message returned from the system is non-ascii, unicode() fails to convert it. Furthermore, unicode(), when called on an object, does not allow specifying a particular encoding.

So how do we fix this?

The intermediate fix, given that Linux systems use utf-8 encoding for error messages:

try:
    with codecs.open(filename, encoding="utf-8") as f:
        return f.read()
except IOError, ioerr:
    logger.error(u"Failed to open file: %s" % unicode(str(ioerr), 'utf-8'))

This works! …at least, until you try to run this code on a Windows machine. The standard encoding on Windows machines is not utf-8, and therefore we yet again get a UnicodeDecodeError.

The final, fixed code:

try:
    with codecs.open(filename, encoding="utf-8") as f:
        return f.read()
except IOError, ioerr:
    logger.error(u"Failed to open file: %s" % unicode(str(ioerr), sys.stderr.encoding))

Our best guess for the encoding in error messages is the encoding of stderr. In theory this may not be guarenteed, but this works across Linux and Windows, so we’ll call that good enough.

But wait, there’s more!

We’ve seen that messages from the system are suspect to encoding issues, but there are other instances where locale plays a critical but hidden role.

For example, take the function strftime(), a C function for formatting dates (taking locale into account). Python implements this via a method on date, time, and datetime objects.

Our initial, un-localized code:

>>> dt = datetime.datetime.now()
>>> print dt.strftime("%c")
Fri Apr 26 20:25:28 2013

The “%c” format specification formats the datetime according to the active locale, and returns a utf-8 encoded string. The following code will break, then, under any non-ascii locale:

logger.info(u"Timestamp: %s", dt.strftime("%c"))

When the datetime string is formatted into the unicode string, it is coerced to unicode by Python, using the ascii codec. As we’ve seen, this is a recipe for frustrating bugs.

Again, the solution is to specify the proper encoding:

logger.info(u"Timestamp: %s", unicode(dt.strftime("%c"), "utf-8"))

Complicating the issue, Python’s interpreter does not run under the system locale, instead running with the default ‘C’ locale — the result being that code that works in the interpreter may not necessarily work in a program:

>>> import datetime
>>> dt = datetime.datetime.now()
>>> print unicode(dt.strftime("%c"))
Fri Apr 26 20:51:00 2013
>>>
>>> import locale
>>> locale.setlocale(locale.LC_ALL, locale.getdefaultlocale())
'ja_JP.UTF-8'
>>> print unicode(dt.strftime("%c"))
Traceback (most recent call last):
  File "", line 1, in
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 4: ordinal not in range(128)

The lesson to be learned: If you care about localization, learn where your code is impacted by locale settings, and test against *non-ascii* locales. (No, “en_US.utf-8” doesn’t count). Python’s locale module makes this relatively straightforward.

Advertisements