Regex matching in Python is done in two ways: greedy and non-greedy (also called lazy matching).
The Difference Between Greedy and Non-greedy Matching
Greedy matching means the regex engine tries to match as much as possible while still obeying the rules on the overall pattern. On the other hand, non-greedy matching (also called lazy matching) entails matching as little as possible. Note that regex matching is, by default, greedy in Python.
In Python, you can specify greedy and non-greedy matching using the “?” character. The “?” character is used after the quantifier (discussing this shortly), which determines how many times the previous character or group of characters should be matched.
An example of greedy and non-greedy regex matching
Suppose we have a string “fabcdaxyzapq”; we can match a substring starting and ending with “a” in a greedy manner using the pattern “a.*a” – where “.*” matches zero or more occurrences of any character. Here is the Python code.
1 2 3 4 |
str1 = "fabcdaxyzapq" # Greedy matching greedy_result = re.search("a.*a", str1) print(greedy_result.group()) |
Output:
abcdaxyza
We can get the shortest substring starting with “a” and ending with “a” using non-greedy matching, as shown below.
1 2 3 4 |
str1 = "fabcdaxyzapq" # Matching made non-greedy by adding the "?" character after the "*" quantifier. non_greedy_result = re.search("a.*?a", str1) print(non_greedy_result.group()) |
Output:
abcda
Common Regex Quantifiers in Python
The following table contains some common regex quantifiers. As said earlier, these quantifiers are greedy by default. You can add “?” after the quantifier to make them non-greedy.
Quantifier | Description |
a* | Matches zero or more occurrences of “a”. |
a+ | Matches one or more occurrences of “a”. |
a? | Matches zero or one occurrence of “a”. |
a{m} | Matches m occurrences of “a”. |
a{m,n} | Matches m to n (inclusive) occurrences of “a”. |
More Examples of Greedy and Non-greedy Matching
Example 1
1 2 3 4 5 6 |
import re str2 = "aaaabbccd" greedy_search = re.findall("b+", str2) print(greedy_search) non_greedy_search = re.findall("b+?", str2) print(non_greedy_search) |
Output:
['bb'] ['b', 'b']
The pattern “b+” matches one or more occurrences of b+. In the example above, “b+” will match the two “b” letters in “aaaabbccd”.
The expression “b+?”, on the other hand, is the non-greedy version of “b+”, which means it will match one or more occurrences of “b”, but it will try to get the smallest possible sequence of “b” characters. Therefore, the pattern will match individual “b” characters.
Example 2
1 2 3 4 5 6 7 8 9 |
str3 = 'This is a test string' # Greedy matching pattern = '\w+' greedy = re.findall(pattern, str3) print(greedy) # Non-greedy matching pattern = '\w+?' non_greedy = re.findall(pattern, str3) print(non_greedy) |
Output:
['This', 'is', 'a', 'test', 'string'] ['T', 'h', 'i', 's', 'i', 's', 'a', 't', 'e', 's', 't', 's', 't', 'r', 'i', 'n', 'g']
In the example above, “\w+” matches one or more occurrences of any alphanumeric character or underscore.
Greedy matching matches as many characters as possible to form a list of complete words (matching stops when it hits white space, which is not alphanumeric).
On the other hand, non-greedy matching gets the fewest number of characters based on the pattern. For that reason, ‘\w+?’ matches individual alphanumeric characters.
Conclusion
This article discusses two forms of regex matching in Python- greedy and non-greedy matching. The former is implemented by default, but the latter can be implemented explicitly by adding a “?” character after the regex quantifier. After going through the examples in this guide, you should be able to implement greedy and non-greedy matching easily.