Why should we NOT use sys.setdefaultencoding(“utf-8”) in a py script?

Question

tl;dr

The answer is NEVER! _{(unless you really know what you’re doing)}

9/10 times the solution can be resolved with a proper understanding of encoding/decoding.

1/10 people have an incorrectly defined locale or environment and need to set:

PYTHONIOENCODING="UTF-8"

in their environment to fix console printing problems.

What does it do?

~~sys.setdefaultencoding("utf-8")~~ (struck through to avoid re-use) changes the default encoding/decoding used whenever Python 2.x needs to convert a Unicode() to a str() (and vice-versa) and the encoding is not given. I.e:

str(u"\u20AC")
unicode("€")
"{}".format(u"\u20AC")

In Python 2.x, the default encoding is set to ASCII and the above examples will fail with:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

(My console is configured as UTF-8, so "€" = '\xe2\x82\xac', hence exception on \xe2)

or

UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)

~~sys.setdefaultencoding("utf-8")~~ will allow these to work for me, but won’t necessarily work for people who don’t use UTF-8. The default of ASCII ensures that assumptions of encoding are not baked into code

Console

~~sys.setdefaultencoding("utf-8")~~ also has a side effect of appearing to fix sys.stdout.encoding, used when printing characters to the console. Python uses the user’s locale (Linux/OS X/Un*x) or codepage (Windows) to set this. Occasionally, a user’s locale is broken and just requires PYTHONIOENCODING to fix the console encoding.

Example:

$ export LANG=en_GB.gibberish
$ python
>>> import sys
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> print u"\u20AC"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 0: ordinal not in range(128)
>>> exit()

$ PYTHONIOENCODING=UTF-8 python
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u"\u20AC"
€

What’s so bad with sys.setdefaultencoding(“utf-8”)?

People have been developing against Python 2.x for 16 years on the understanding that the default encoding is ASCII. UnicodeError exception handling methods have been written to handle string to Unicode conversions on strings that are found to contain non-ASCII.

From https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/

def welcome_message(byte_string):
    try:
        return u"%s runs your business" % byte_string
    except UnicodeError:
        return u"%s runs your business" % unicode(byte_string,
            encoding=detect_encoding(byte_string))

print(welcome_message(u"Angstrom (Å®)".encode("latin-1"))

Previous to setting defaultencoding this code would be unable to decode the “Å” in the ascii encoding and then would enter the exception handler to guess the encoding and properly turn it into unicode. Printing: Angstrom (Å®) runs your business. Once you’ve set the defaultencoding to utf-8 the code will find that the byte_string can be interpreted as utf-8 and so it will mangle the data and return this instead: Angstrom (Ů) runs your business.

Changing what should be a constant will have dramatic effects on modules you depend upon. It’s better to just fix the data coming in and out of your code.

Example problem

While the setting of defaultencoding to UTF-8 isn’t the root cause in the following example, it shows how problems are masked and how, when the input encoding changes, the code breaks in an unobvious way:
UnicodeDecodeError: ‘utf8’ codec can’t decode byte 0x80 in position 3131: invalid start byte

tl;dr

What does it do?

Console

What’s so bad with sys.setdefaultencoding(“utf-8”)?

Example problem

Leave a Comment Cancel reply