The re package is a core module for regular expressions in Python. This article will cover ways of searching for a word in a string using this package.
In particular, we will discuss how to search for a whole word in a string, a word with a given substring, words starting with or ending with x, etc. However, before we do that, let’s get some basics out of the way.
Some Special Characters in re
Some characters in re are special. These characters affect how patterns are interpreted. Here are a few examples we will use for the most part.
Character | Meaning | Example |
* | Matches 0 or more repetitions of the preceding character(s) or expression | “ab*” matches “ab”, “abb”, or “a” followed by any number of “b”s. |
+ | Matches 1 or more characters of the preceding character or regex | “ab+” matches “abb” or “ab” followed by any number of “b”s. DOES NOT MATCH “ab” |
\w (the opposite is \W) | Matches any alpha-numeric character and the underscore. | |
\b | Matches whitespace. Formally, \b is defined as the boundary between a \w and a \W character (or vice versa). | |
\d (\D does the opposite) | Matches a decimal digit, 0-9 |
It’s time to see some examples now.
Example 1: Finding a Whole Word in a Python String
We will use the r”\b<word>\b” pattern to find a whole word in a string. As mentioned earlier, a word is separated by white spaces (\b).
The “r” preceding the pattern renders anything inside the quotes as a raw string (no character is given special meaning). We use the escape character, “\”, to override that and introduce special characters, e.g., “\b”.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import re # The string we want to search and the pattern to look for str1 = "This is a bigger group. There is not a small group" pattern = r"\bgroup\b" # Using re.search() method to find the patter "\bgroup\b" # which matches the word "group" result = re.search(pattern,str1) # Print the result print(result) # Start and end indices print("Starts at: ", result.start()) print("Ends at: ", result.end()) # Span of the word - a tuple of start and end. print("Spans: ", result.span()) |
Output:
<re.Match object; span=(17, 22), match='group'> Starts at: 17 Ends at: 22 Spans: (17, 22)
Notice that re.search() matches only the first instance of the word. If you want to match all the occurrences of the word, use the re.finditer(). It generates an iterator that yields all instances matching the pattern provided; for example,
1 2 3 4 5 6 7 8 9 |
import re # The string we want to search str1 = "This is a bigger group. There is not a small group" pattern = r"\bgroup\b" result_iter = re.finditer(pattern, str1) for result in result_iter: print(result) |
Output:
<re.Match object; span=(17, 22), match='group'> <re.Match object; span=(45, 50), match='group'>
Example 2: Search for Multiple Words
We can search for multiple words in a string using the OR operator (“|”) in re. For example,
1 2 3 4 5 6 7 8 |
import re # Match more than one word - matching small, big, and group results = re.finditer(r"\bsmall\b|\bis\b|\bgroup\b", "This is a bigger group. There is not a small group") # Looping through all instances using list comprehension results = [i for i in results] print(results) |
Output:
[<re.Match object; span=(10, 13), match='big'>, <re.Match object; span=(17, 22), match='group'>, <re.Match object; span=(39, 44), match='small'>, <re.Match object; span=(45, 50), match='group'>]
We can also use the re.findall() function to find all matches as a list of strings.
1 2 3 4 5 |
import re results = re.findall(r"\bsmall\b|\bis\b|\bgroup\b", "This is a bigger group. There is not a small group") print(results) |
Output:
['is', 'group', 'is', 'small', 'group']
Example 3: Finding words that start or end with
Here are some examples. The code contains comments to help you understand.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import re str1 = "It always bothers me that, accor56ding to the a laws as we understand them today" # Find words that start with "a" results1 = re.findall(r"\ba\w*\b", str1) print(results1) # Matches words starting with "a" but must be followed with at least # one alpha-numeric. The letter "a" is not a match. results2 = re.findall(r"\ba\w+\b", str1) print(results2) # Matches any word that ends with "ws" results3 = re.findall(r"\w*ws\b", str1) print(results3) # Words starting with a but ignore words with digits, e.g. "accor56ding" #[^\d\W] matches all character except numbers (\d) and non-alphanumeric (\W). results4 = re.findall(r"\ba[^\d\W]+\b", str1) print(results4) |
Output:
['always', 'accor56ding', 'a', 'as'] ['always', 'accor56ding', 'as'] ['laws'] ['always', 'as']
The first three cases in the above snippet accept alpha-numeric (\w), but the last one does not. The latter picks all words without numbers.
Note also the difference between \w* and \w+. The former matches 0 or more characters, but the latter matches at least one alpha-numeric character.
We can even do more. Find words that start with and/or end with. An example is given below.
1 2 3 4 5 6 7 8 9 10 11 |
import re str1 = "It always bothers me that, accor56ding to the a laws as we understand them today" # Starts with bo or ends with e. Note the use of or ("|") operator. results3 = re.findall(r"\bbo\w*|\w*e\b", str1) print(results3) # Starts with "un" and ends with "d". results5 = re.findall(r"\bun\w*d\b", str1) print(results5) |
Output:
['bothers', 'me', 'the', 'we'] ['understand']
Example 4: Find a word that has a given substring or character
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import re # Starts with or ends with str1 = "It always bothers me that, acco56rding to the a laws as we understand them today" # \w* captures 0 or more alpha-numeric before "a". # \b\w*a\w*\b captures any word with the letter "a", including "a" itself. results = re.findall(r"\b\w*a\w*\b", str1) print(results) # \w+ means at least one alpha-numeric. That means the patterns in the following # line matches a word starting with s but must be followed by an alpha-numeric # before the white space results = re.findall(r"\b\w*s\w+\b", str1) print(results) # \d captures a numerical digit. Therefore this pattern will capture a word with a digit # in it, and the digit must be followed by alpha-numeric (\w+) before the white space (\b). # s must be followed by a letter before the white space. results = re.findall(r"\b\w*\d\w+\b", str1) print(results) |
Output:
['always', 'that', 'acco56rding', 'a', 'laws', 'as', 'understand', 'today'] ['understand'] ['acco56rding']
Example 5: Dealing with Capitalization in Python re
The re module is case-sensitive by default. You can change that by issuing re.IGNORECASE flag. Here is an example.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import re str1 = "Group members are committed to the group" # Finding the word group with re in a default way. # Find "group" only and not "Group" result1 = re.findall(r"\bgroup\b", str1) print(result1) # The re search is turned case insensitive # Finds both "Group" and "group" result2 = re.findall(r"\bgroup\b", str1, re.IGNORECASE) print(result2) |
Output:
['group'] ['Group', 'group']
Conclusion
The special character \b is mainly used to define the boundaries of words. Therefore, you may find it helpful in most cases when you are searching for a word in a Python string using regex. In this article, we have discussed five examples of looking for words using re.
You can do more practice on how to write regular expressions using these sites: https://regexr.com/ or https://regex101.com/.