This article discusses removing the special characters from a Python string using the re module. We will cover all cases you might think of – remove punctuations, white spaces, numbers, and the first letter of every word in the string, among other cases.
For all examples, we will use the re.sub(pattern, repl, string) function to replace all matches for the pattern in the string with repl. In all cases, we will set repl=”” (empty string) so that replacements effectively remove the matches.
Let’s work on some examples.
Example 1: Removing the Last Character in the String
1 2 3 4 5 6 |
import re string = "Chicago" # Remove the last character from a string result = re.sub(".$", "", string) print(result) |
Output:
Chicag
The pattern: “.$”
Explanation: The “$” matches the last character in the string, str1, and “.” matches any character except a new line. That makes the “.$” match the last character of the string.
Example 2: Remove the First or the Last Word in the String
1 2 3 4 5 |
import re string = "This is a test string" result = re.sub(r"\b\w+$", "", string) print(result) |
Output:
This is a test
The pattern: r”\b\w+$”
Explanation: As said in example one, “$” matches the end of a string. \w+ matches one or more word characters (alphanumeric characters plus underscore (_)), and \b matches the word boundary. That means r”\b\w+$” matches a word boundary followed by a word of any length at the end of the string.
Note: If you have string-terminating characters like the period at the end of the string, the pattern above will fail. For such a case, use r” s+S+$”.
1 2 3 4 5 |
import re string = "This is a test string." new_string = re.sub(r"\s+\S+$", "", string) print(new_string) |
Output:
This is a test
The pattern r”\s+\S+$” matches one or more whitespaces (\s+) followed by one or more non-whitespace characters (\S+) until the end of the string ($).
You can remove the first word in a string using the r”^\w+\s*” pattern, as shown below.
1 2 3 4 5 6 |
# Remove the first word import re string = "This is a test string" result = re.sub(r"^\w+\s*", "", string) print(result) |
Output:
is a test string
The “^” matches the beginning of the string, \w+ matches one or more word characters, and \s* matches 0 or more whitespace characters. That means r”^\w+\s*” matches the first word and all whitespaces coming after it. If you want to remove the word only and not white spaces, remove “\s*”.
Example 3: Remove All or Specific Punctuation Marks
1 2 3 4 5 6 |
# Remove all punctuation marks import re string = "T@h#is $is a t%est %st^rin&g*" result = re.sub(r"[^\w\s]", "", string) print(result) |
Output:
This is a test string
The pattern: r”[^\w\s]”
Explanation: The [ ] is used to indicate a set of characters, e.g., [abc] matches the characters “a”, “b”, and “c”. When the “^” character comes at the beginning of the set, it means the complement of the set, e.g., [^abc] matches all characters except “a”, “b”, and “c”.
That means r”[^\w\s]” matches any character that is not a word or white space character.
If you want to remove specific punctuation marks, specify them inside the set character. For example,
1 2 3 4 5 6 |
# Remove specific punctuation marks import re string = "T@h#is $is a t%est %st^rin&g*" result = re.sub(r"[@$^%]", "", string) print(result) |
Output:
Th#is is a test strin&g*
Example 4: Remove a Character or Series of Characters
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Remove a character import re string = "This is a test string" result = re.sub(r"i", "", string) print(result) # You can also replace a series of characters with a substring. string = "This is a test string" result = re.sub(r"is", "", string) print(result) # Replace multiple characters - "i" and "t", in this case. string = "This is a test string" result = re.sub(r"[it]", "", string) print(result) # Remove "is" or "in" string = "This is a test string" result = re.sub(r"is|in", "", string) print(result) |
Output:
Ths s a test strng Th a test string Ths s a es srng Th a test strg
Example 5: Remove Numbers
This example discusses how to remove signed and unsigned numbers plus decimal numbers.
1 2 3 4 5 6 7 8 9 10 |
# Removing unsigned numbers import re string = "Thi44s is a 56 tes5t stri99ng" string2 = "Thi44s is a 56 t+8es5t stri99n-7g" result = re.sub(r"\d+", "", string) # failed to pick signed numbers: +8 and -7 result2 = re.sub(r"\d+", "", string2) print(result) print(result2) |
Output:
This is a test string This is a t+est strin-g
The “\d+” matches one or more consecutive digits in the string. As shown in the output, the pattern used in the example above only works for unsigned numbers – numbers without + or – signs, e.g., -95 and +4. Let’s fix that.
1 2 3 4 5 6 7 8 9 10 |
import re string = "Thi+44s is a 56 tes-5t stri-99ng" string2 = "Thi+44s i-3.4s a 56 tes-5t stri-99n+5.8g" # matches signed and unsigned result = re.sub(r"\+?-?\d+", "", string) # doesn't pick decimal numbers: -3.4 and +5.8 result2 = re.sub(r"\+?-?\d+", "", string2) print(result) print(result2) |
Output:
This is a test string This i.s a test strin.g
As the example above shows, the pattern r”\+?-?\d+” now works for signed and unsigned numbers but fails to match decimals. Let us fix that as well.
1 2 3 4 5 |
import re string = "Thi+44s i-3.6s a 56 tes-5t s50.2tri-99ng" result = re.sub(r"\+?-?\d+(\.\d+)?", "", string) print(result) |
Output:
This is a test string
The conclusion for this example: The pattern r”\+?-?\d+(\.\d+)?” is best for capturing any number – signed, unsigned, and decimals.
Example 6: Remove the First or the Last x Characters
1 2 3 4 5 6 7 8 9 10 11 |
# Remove the first five characters import re string = "This is a test string" result = re.sub(r"^.{6}", "", string) print(result) # Remove the last five characters import re string = "This is a test string" result = re.sub(r".{8}$", "", string) print(result) |
Output:
s a test string This is a tes This is a test s
Example 7: Remove All Whitespaces or Extra Spaces
1 2 3 4 5 6 |
# Remove whitespaces import re string = "This is a test string" result = re.sub(r"\s+", "", string) print(result) |
Output:
Thisisateststring
1 2 3 4 5 6 |
import re # remove extra spaces string = "This is a test string" cleaned_string = re.sub(r"\s+", " ", string) print(cleaned_string) |
Output:
This is a test string
Example 8: Remove Capital Letters
1 2 3 4 5 6 |
# Remove caps import re string = "THIS is A Test StrinG" no_caps = re.sub(r"[A-Z]", "", string) print(no_caps) |
Output:
is est trin
Example 9: Remove the First or Last Character in Every Word
1 2 3 4 5 6 |
import re string = "This is a test string" # remove the first letter of every word result = re.sub(r"\b\w", "", string) print(result) |
Output:
his s est tring
1 2 3 4 5 6 |
import re s = "This is a test string" # remove the last character in every word result = re.sub(r"\w(?=\s|$)", "", s) print(result) |
Output:
Thi i tes strin
Example 10: Remove Punctuation Marks and Numbers
1 2 3 |
string = "Thi+44s is %a 56 t%es-5t s*tri-99NG" result = re.sub(r"[^a-zA-Z\s]+", "", string) print(result) |
Output:
This is a test striNG
The pattern: r"[^a-zA-Z\s]+"
Explanation: The set [^a-zA-Z\s] matches any character except lower and capital alphabets and whitespaces.
Conclusion
This article discussed removing characters from a string using regular expressions in Python. The ten examples covered here present different cases of the problem. You can read more about Python regex in re documentation or practice writing regular expressions at https://regexr.com/ and https://regex101.com/.