Pandas and csv libraries are popular names when handling CSV files. Python comes pre-installed with csv, but for pandas, you must install it before using it.
When it comes to data manipulation and analysis, pandas reign supreme because it possesses many functions and attributes that can perform such tasks. This article focuses on how these libraries implement encoding when reading CSV files.
When reading CSV files (using pandas or csv), the following processes are conducted: decoding, parsing, data conversions (optional), and data fetching.
Decoding: To read a file, the library must first convert a series of bytes into characters from a particular charset. Sometimes, this section is challenging since the library might not be aware of the file’s encoding. The library may raise an exception at this moment. For instance, if it cannot recognize the encoding or runs into byte sequences that it cannot decode, it may produce an error message.
With Python 3 and local systems getting better at encoding, the encoding process mostly happens seamlessly without us having to explicitly define the encoding system when loading CSV files. However, encoding is still a vital issue when we want to filter out some unwanted characters in our CSV file or some cases, get data in the needed view.
We will save the following simple data into a UTF-8 encoded CSV file named “streets10.csv” and use it for our examples considering two encodings – ASCII and UTF-8. ASCII encoding is the most common character encoding format for English text, whereas UTF-8 contains much more characters.
In the above data, the following characters are none ASCII characters: ß, Ä, à, and â. Any attempt to read the CSV file with ASCII encoding will result in encoding errors because of these characters.
pandas.read_csv(filepath_or_buffer, encoding=None, encoding_errors='strict', …)
Note: encoding_errors is a new argument on Pandas v1.3.0. You may need to upgrade pandas to use this argument. I am running pandas v1.4.3 in these examples.
Pandas provide encoding out of the box. By default, pandas encode input file data with UTF; in most cases, it is a sufficient encoding system. The argument encoding_errors is set to “strict” so that Python raises UnicodeDecodeError if malformed data is encountered. Other options for encoding_errors include “ignore” and “replace“.
import pandas as pd df = pd.read_csv("streets10.csv")
That reads our CSV file with no problem, but if we issue the ASCII encoding, UnicodeDecodeError will be raised because of non-ASCII characters on the CSV.
import pandas as pd df = pd.read_csv("streets10.csv", encoding="ascii")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
If we want to replace or ignore non-ASCII characters, we can issue the encoding_errors argument accordingly. For instance, in the following example, malformed characters are replaced.
import pandas as pd df = pd.read_csv("streets10.csv", encoding="ascii", encoding_errors="replace") print(df)
Output (formatted for better viewing):
Pandas stopped supporting Python 2 in Jan 2018. By then, it was v0.24.2 and did not accept the encoding_errors argument.
The Python built-in library csv has a read() function with the following general notation (see https://docs.python.org/3/library/csv.html).
csv.reader(source, dialect=’excel’, **fmtparams)
- source is an iterator object, and its __next__() method produces an object named string object. The fact that a list or tuple of strings can also be used as a CSV source makes this significant. The Python CSV reader method allows you to use other types of objects besides just files as the source.
- dialect is a free-form parameter that can either be the name of a dialect class or a registered dialect.
- A set of optional keyword parameters known as “fmtparams” have the form <param>=<value>, where <param> stands for one of the formatting parameters and <value> for its value. When a dialect and a few formatting parameters are supplied together, the formatting parameters take precedence over the appropriate dialect attributes.
If encoding is not provided in text mode, the encoding that is used depends on the platform:
The current locale encoding is obtained by calling locale.getpreferredencoding(False).
Again, Python 3 supports CSV encoding out of the box on the csv reader using the encoding argument in the open() function. Here is an example,
import csv with open("streets10.csv", newline="", encoding="utf8") as f: reader = csv.reader(f, dialect="excel") for row in reader: print(row)
['Name', 'Streets'] ['Bob', 'NazarethkirtchStraße'] ['Alex', 'St Äbràhâm']
Like in pandas, open provides ways of handling encoding errors using the “errors” argument. It gives the same options as pandas: strict, replace, and ignore, among others. By default, if the incorrect encoding is given, Python throws UnicodeDecodeError. In the following example, we ignore characters that cannot be encoded with the encoding system provided.
import csv with open("streets10.csv", newline="", encoding="ascii", errors="ignore") as f: reader = csv.reader(f, dialect="excel") for row in reader: print(row)
['Name', 'Streets'] ['Bob', 'NazarethkirtchStrae'] ['Alex', 'St brhm']
Python 2 does not provide an encoding argument in the open() function. For this reason, we have to read the data first, as bytes, then decode the characters. Here is an example of what the above code will look like in Python 2.
with open("streets10.csv", "rb") as csvfile: csvreader = csv.reader(csvfile, delimiter=",") for row in csvreader: row = [entry.decode("utf-8") for entry in row] print(": ".join(row))
Name: Streets Bob: NazarethkirtchStraße Alex: St Äbràhâm
Alternative, in Python 2, you can use the codecs module to read CSV, as shown below:
import codecs delimiter = "," reader = codecs.open("streets10.csv", "r", encoding="utf8") for line in reader: print(line)
Name,Streets Bob,NazarethkirtchStraße Alex,St Äbràhâm