The re module is a package at the center of regular expressions in Python. In most cases, we are interested in matching patterns. However, in other cases, we want to exclude some strings based on a given pattern.
This article covers different methods used to exclude substring, giving examples in each case.
Using re Special Characters
Two special characters come in handy in this case: [ ] and ^. The [ ] character is used to specify a set of characters we wish to match.
For example, “[abc]” is used to match characters “a”, “b” and “c”, and “[0-5][0-9]” matches number strings between 00 and 59.
If the first character in the set, [ ], is “^”, then all the characters in the set are excluded. For example, “[^xyz]” matches all characters except “x”, “y” and “z”. Here are more examples,
Regular expression | Meaning |
“[^a-z]” | Matches all characters except any lowercase ASCII alphabetical letter |
“[^.]” | Matches any character except a period |
“[^\w]” | Matches any character except word characters |
“[a^z]” | Matches “a”, “^”, and “z”. |
Note: “[^^]” will match any character except “^”. ^ has no special meaning if it’s not the first character in the set, [ ] (see the last row of Table 1).
Let’s now see some of the examples in the code.
Example 1: Excluding all integers
The character “\d+” matches any number in the string passed. If we want to exclude numbers, we need to use the pattern “[^\d+]”.
1 2 3 4 5 6 7 8 9 10 |
import re # Exclude integers in the string. # Find all non-number substrings. result0 = re.findall(r"[^\d]+", "Look like 27readable E6nglish. Many de0sktop") print(result0) # Optionall - you can join the strings at the point where numbers were stripped off # using join() method print("".join(result0)) |
Output:
['Look like ', 'readable E', 'nglish. Many de', 'sktop'] Look like readable English. Many desktop
All the numbers have now been excluded.
Example 2: Exclude non-integer numbers (numbers with decimals)
What if we have real numbers with decimals in our string? The code in Example 1 will fail because “\d” captures characters in the set [0-9] only. To capture numbers non-integer numbers, we can modify the pattern as follows.
1 2 3 4 5 6 7 |
# exclude numbers with decimals. import re result1 = re.findall("[^(\d+(?:\.\d+)?)]+", "maki2.9ng it over 2000 years old. Richa6.1rd McCl8intock") print(result1) print("".join(result1)) |
Output:
['maki', 'ng it over ', ' years old', ' Richa', 'rd McCl', 'intock'] making it over years old Richard McClintock
The code snippet above successfully excluded all numbers in the string: 2.9, 2000, 6.1, and 8.
Exclude by substitution using re.sub()
The re.sub(pattern, replace, string) method is used to find a substring matching the given pattern in the string and replace it.
1 2 3 4 5 6 7 8 9 10 11 12 |
import re # string to search and substitute. string1 = "All the Lorem Ipsum as necessary. It uses a dictionary of over 200 words, generate Lorem Ipsum." # find "Lorem Ipsum" and repace with an empty string result3 = re.sub("Lorem Ipsum","" ,string1) print(result3) # find any number(s) in string1 and replace it with "999" result4 = re.sub("\d+","999",string1) print(result4) |
Output:
All the as necessary. It uses a dictionary of over 200 words, generate . All the Lorem Ipsum as necessary. It uses a dictionary of over 999 words, generate Lorem Ipsum.
Exclude by Searching and Skipping
In this section, we want to search for a substring based on a pattern and do one thing if it is found and another if it is not.
In the example below, we want to search and skip/exclude strings with the word “test”.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import re lst = [ "it is a long established fact that testa reader will be distracted", "by the readable content of a page when looking at its layout.", "The point of using Lorem Ipsum is that it has a more-or-less ", "normal distribution of test letters, as opposed to using", "'Content here, content here', maktesting it look like readable English.", "Many desktop publishing packages and web page editors now use Lorem Ipsum", ] for s in lst: result4 = re.search("test", s) # return None if "test" is not found, otherwise returns the position of "test" in "s" if result4 is not None: # do something print(f"test found, {result4}") else: # do something else print(f"test not found, returned {result4}") |
Output:
test found, <re.Match object; span=(35, 39), match='test'> test not found. returned None test not found. returned None test found, <re.Match object; span=(23, 27), match='test'> test found, <re.Match object; span=(33, 37), match='test'> test not found. returned None
Conclusion
This article discussed methods of excluding some substrings based on patterns given to re. You can study more about re in the documentation and practice writing regular expressions at https://regexr.com/ or https://regex101.com/.
By doing so, you get to understand more about how to write patterns that fit your used case.