This article goes through two things: how to capture groups in Python regex and how to replace the groups within a string.
Capturing Groups in Regex
The parenthesis ()is used to group sub-patterns in regex. For example, the pattern (a|b)xz(c|d) matches any substring that starts with a or b, followed by xz, and then ends with c or d.
The first sub-pattern, (a|b), is said to belong to group 1, and the second, (c|d), is said to be of group 2. Let’s see an example in code (some explanations about the code are found after the snippet).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import re str1 = "a black dog and black cat tore a black cloth" pattern = r"(a|and) black (dog|cat)" # re.finditer() returns a generator for all instance matches matches = re.finditer(pattern, str1) for match in matches: print("The match:", match) # Returns the entire match print("Group 0: ", match.group(0)) # first captured group (the|and) print("Group 1: ", match.group(1)) # second captured group (dog|cat) print("Group 2: ", match.group(2)) # returns a tuple of all groups found. print("All groups:", match.groups()) |
Output:
The match: <re.Match object; span=(0, 11), match='a black dog'> Group 0: a black dog Group 1: a Group 2: dog All groups: ('a', 'dog') The match: <re.Match object; span=(12, 25), match='and black cat'> Group 0: and black cat Group 1: and Group 2: cat All groups: ('and', 'cat')
Explanation:
Our string: str1 = “a black dog and black cat tore a black cloth”
The pattern: pattern = r”(a|and) black (dog|cat)”
The pattern r”(a|and) black (dog|cat)” matches any substring that starts with a or and, followed by black, and ends with dog or cat.
Therefore, for our string, str1, this pattern matches “a black dog” (as group 1) and “and black cat” (as group 2).
Referencing Groups in Regex
There are two ways to reference groups:
- \1, \2 up to \99 to refer to the corresponding captured group. However, note that \0 and any three-digit reference will be interpreted as an octal value and not a valid group reference. If you need to reference the 100th group or higher, use the second method.
- \g<1>, \g<2>, …. (not limited to 99) to refer to the corresponding captured group. This method helps avoid ambiguity because 3-digits are allowed after group reference.
We will see more about the group referencing in the next section.
Replacing Groups in Regex
We need to do three things here: capture a group, create its reference and then replace it accordingly.
We will use the re.sub(pattern, replace, string) function to substitute matches in string with replace.
Let’s see some examples.
Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
import re str1 = "a black dog and black cat tore a black cloth" pattern = r"(a|and) black (dog|cat)" # re.finditer() returns a generator for all instance matches matches = [i for i in re.finditer(pattern, str1)] print(matches) # Replaces the matches with "xxx" result0 = re.sub(pattern, r'xxx', str1) print(result0) # Matches for group 1 matches = [i.group(1) for i in re.finditer(pattern, str1)] print(matches) result1 = re.sub(pattern, r'\1xxx', str1) # or r'\g<1>xxx' print(result1) # Matches for group 2 matches = [i.group(2) for i in re.finditer(pattern, str1)] print(matches) # For replacing result2 = re.sub(pattern, r'\g<2>xxx', str1) # or r'\2xxx' print(result2) |
Output:
[<re.Match object; span=(0, 11), match='a black dog'>, <re.Match object; span=(12, 25), match='and black cat'>] xxx xxx tore a black cloth ['a', 'and'] axxx andxxx tore a black cloth ['dog', 'cat'] dogxxx catxxx tore a black cloth
Explanation:
- As said before, r”(a|and) black (dog|cat)” matches a substring starting with a or and, followed by black and ends with dog or cat. This matches “a black dog” and “and black cat” in str1.
- re.sub(pattern, r’xxx’, str1) replaces the two matches with “xxx”.
- re.sub(pattern, r’\1xxx’, str1) or re.sub(pattern, r’\g<1>xxx’, str1) substitutes group 1 matches with group 1 values followed by “xxx”. The same explanation holds for matching group 2 using re.sub(pattern, r’\2xxx’, str1)
Example 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import re string = "254, 0704 502 7898" # Three digit number (\d{3}) as a group # followed by 1 or more white spaces (\s+) pattern = r"(\d{3})\s+" # Just to check the matches matches = re.finditer(pattern, string) for match in matches: print("Match: ", match) print("Group 1 match: ", match.group(1)) # replace matches with group 1 values + "xx" result = re.sub(pattern, r"\1xx", string) print(result) |
Output:
Match: <re.Match object; span=(6, 10), match='704 '> Group 1 match: 704 Match: <re.Match object; span=(10, 14), match='502 '> Group 1 match: 502 254, 0704xx502xx7898
Explanation:
- The pattern r”(\d{3})\s+” matches three-digit numbers, \d{3}, followed by one or more spaces, \s+, in 254, 0704 502 7898′. That means “704” and “502” are matched.
- r”\1xx” means matches are replaced with group 1 values followed by “xx”
Example 3
The following example swaps two decimal numbers along the comma (,).
1 2 3 4 5 6 |
import re coords = "0.71331, 52,25378" coord_re = re.sub(r"(\d+[.,]\d+), (\d+[.,]\d+)", r"\2, \1", coords) print(coord_re) |
Output:
52,25378, 0.71331
Explanations:
- \d+ means a number with at least 1 digit, [.,] means a set that matches comma and period. That means r” d+[.,]\d+” matches 0,718 and 0.718 and 22.89 and 52.25378, et cetera.
- r”\2, \1″ means group 2 values come first before group 1. This swaps the position of group members.
Example 4
1 2 3 4 5 |
import re result = re.sub(r"\[(\d+)\]", r"\g<1>3", "[12] pencils and [8] books") print(result) |
Output:
123 pencils and 83 books
Explanations:
- r”\[(\d+)\]” matches a digit enclosed in square brackets, for example, [1] and [99]. The number inside the brackets is captured as a group.
- r”\g<1>3″ replaces the matches with group 1 values (numbers in the string) followed by 3.
Example 5
In this example, we want to process image names. In particular, we want to remove the substring matching the pattern defined. Comments are included to explain what the code does.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import re # Matches hyphen (-) followed by alphanumerics (\w+) as group 1 # followed by ".jpg" or ".png" or ".svg" as group 2 pattern = r"(\-\w+)(\.jpg|\.png|\.svg)" # we want to remove the substring after the hyphen before the extension images = [ "-abc123567-abc.jpg", "test1.png", "1267-exit.jpg", "-abc34567-stuff.jpg", "bxdsj-example.jpg", "test1-smd.svg", ] for image in images: # r"\2" references group 2. Therefore the group 1(the unwanted) # part is discarded result = re.sub(pattern, r"\2", image) print(result) |
Output:
-abc123567.jpg test1.png 1267.jpg -abc34567.jpg bxdsj.jpg test1.svg
Conclusion
This post covered how to replace groups in Python regex by working on five examples. While going through these examples may help you understand how to replace groups in regex, you may find it useful to practice writing regular expressions to match your used case using these two sites: https://regexr.com/ and https://regex101.com/.