Text files use an “encoding,” i.e., a system that prescribes how characters in the files are supposed to be interpreted. This encoding may differ between operating systems. You can see the preferred encoding that your system uses with a call to sys.getfilesystemencoding().

from sys import getfilesystemencoding

print( getfilesystemencoding() )

If you read a text file which uses a different encoding than your file system prefers, you may get a UnicodeDecodeError. Whether or not you get this error for a particular file, is related to your operating system. An annoying consequence of that is that when you port Python code that reads a file to another system, a file that could be read by your code previously may cause your code to crash after the port.11

An easy way to get around this problem is by adding an extra parameter when opening a file, which indicates the encoding mechanism that you want to use when reading the file. You do this by adding a parameter encoding=<encodingname>, where <encodingname> is a string that can have a variety of values, for which some typical ones are:

Typically, text files are created with ascii or latin-1 encoding. Since ascii is incorporated in latin-1, you can safely open any text file by specifying latin-1 as encoding. It is possible that for the characters beyond the ascii range, you get different characters than the person who created the file wanted you to see – that depends on the encoding mechanism that your file system uses. But at least the UnicodeDecodeError is avoided. So, when you try to read the contents of a file and get a UnicodeDecodeError, you may try to open it using open( <file>, encoding="latin-1" ). Usually that will solve the problem.

Note that while utf-8 supports a much wider range of characters than latin-1, you may still get the UnicodeDecodeError when you read a text file that uses latin-1 encoding on a system that uses utf-8 encoding, as utf-8 has no corresponding characters with values in the (hexadecimal) range 80-FF.

If you want to see which special characters are supported with values in the range 80-FF on your system, run the code below. The numerical value of a character in the table can be derived by calculating \(16*row+col\), whereby row and col are the hexadecimal row and column number, respectively. I do not display the characters in the range 80-9F, as these are normally not filled in.

for i in range(16):
    if i < 10:
        print( ' '+chr( ord( '0' )+i ), end='' )
    else:
        print( ' '+chr( ord( 'A' )+i-10 ), end='' )
print()
for i in range( 10, 16 ):
    print( chr( ord( 'A' )+i-10 ), end='' )
    for j in range( 16 ):
        c = i*16+j
        print( ' '+chr( c ), end='' )
    print()

More details on UTF-8 encoding will be given in Chapter 20, but for dealing with text files, the information above suffices.

  1. I have to make note of some Python behavior that seems bizarre when you first encounter it: you may get this error when your file contains characters in an encoding that is not supported by your system in lines that you are not even trying to read! E.g., suppose that there is such an erroneous character on line 10 of your file, but you are only trying to read the first 5 lines before closing the file again – your program may still crash! I suspect that this is related to the buffering of data: rather than reading exactly what you ask Python to read, Python reads data in bigger chunks, so that the program is faster when you actually want to go through the whole file. So, by trying to be smart, Python may saddle you up with problems that you did not expect could arise. It is good to be aware of such issues. 2