You can convert Unicode characters to an ASCII string using the encode function.
1 2 3 4 |
mytext = "Klüft électoral große" myresult = mytext.encode('ascii', 'ignore') print(myresult) |
All values that are not ASCII characters will be ignored.
b'Klft lectoral groe'
In the encode function, there is a second parameter. In this case, it’s ignoring characters that don’t meet the requirement.
There are also different parameters, for example, replace. In this case, Python inputs question marks, instead of removing the characters, so the result consists of the same amount of characters as the entry string.
The new code looks like this:
1 2 3 4 |
mytext = "Klüft électoral große" myresult = mytext.encode('ascii', 'replace') print(myresult) |
And this is the result.
b'Kl?ft ?lectoral gro?e'
Normalization forms
There is also an option to convert characters to the closest equivalent from ASCII.
For this purpose, we are going to use the normalize function. There are also a few parameters, you can use, but for this demonstration, I’m going to use only one: NFKD.
This is what the code looks like:
1 2 3 4 5 |
import unicodedata mytext = "Klüft électoral große" myresult = unicodedata.normalize('NFKD', mytext).encode('ascii', 'ignore') print(myresult) |
Here’s the result:
b'Kluft electoral groe'
Convert ß to ss
In this case, the sharp S (ß) was not converted to “ss”, but rather ignored. We can quickly fix that by adding the replace function to the mytext variable. It has to be replaced before the normalize function.
1 |
mytext = "Klüft électoral große".replace('ß', 'ss') |
Now, when you run the code the sharp S is not lost.
b'Kluft electoral grosse'
ASCII and UTF-8
Instead of ASCII, you can also use UTF-8 encoding.
1 2 3 |
mytext = "Klüft électoral große" myresult = mytext.encode('utf-8') print(myresult) |
This is what the result looks like:
b'Kl\xc3\xbcft \xc3\xa9lectoral gro\xc3\x9fe'