Convert CSV to UTF-8 in Python

This article involves the conversion of a CSV file into a UTF-8 encoding. This encoding supports over 1.1 million characters using 1 to 4-byte code units. This means the UTF-8 system can encode most of the characters you know in any language.

UTF-8 is the default encoding for Linux, Windows, and macOS. That means these OS systems will always support storing data in this encoding.

This article is for you if you need to convert a CSV file encoded in a different format into UTF-8.

Checking the Encoding Used in a CSV File

This can be done using the chardet package, which can be installed using pip by running the following command:

pip install chardet

The following code shows how to check the encoding used when writing a CSV.

Output:

The UTF-16 encoded employees2.csv file used in the code above is shown below (the file is opened on Notepad):

Converting CSV into UTF-8

This Section discusses three methods to convert a CSV file into UTF-8 encoding.

Method 1: Using csv and codecs packages

This Method achieves the purpose in two steps – read the input CSV with the valid encoding and use codecs to convert the CSV into UTF-8 encoding.

Output:

Note: we passed the errors=” ignore” argument into codecs.open() function in the code above. This ensures that encoding errors encountered when converting CSV into UTF-8 are skipped. This is convenient to ensure that the conversion works, but data that can’t be converted will be lost.

Method 2: Using pandas

Like Method 1, this Method works in two steps – read the input CSV and write the output into another CSV – UTF-8 encoded.

Output:

Like in Method 1, we also passed the errors=”ignore” argument into pd.DataFrame.to_csv() to skip encoding errors.

Note that pandas.read_csv() fails if valid encoding used in the input file is not provided. You can check for the encoding using the code provided at the start of the article.

Method 3: Using codecs and shutil modules

In this Method, we use codecs to open input and output objects with the required encodings and use shutil to copy the contents of the input into output.

Output:

Conclusion

This post discussed three methods for converting CSV into UTF-8 – the first method using csv package, the second using pandas, and the last using codecs and shutil.

All methods allow you to pass the errors=” ignore” argument if you want to skip encoding errors, that is, ignore characters that cannot be encoded with UTF-8.