Regex Replace Group in Python

This article goes through two things: how to capture groups in Python regex and how to replace the groups within a string.

Capturing Groups in Regex

The parenthesis ()is used to group sub-patterns in regex. For example, the pattern (a|b)xz(c|d) matches any substring that starts with a or b, followed by xz, and then ends with c or d.

The first sub-pattern, (a|b), is said to belong to group 1, and the second, (c|d), is said to be of group 2. Let’s see an example in code (some explanations about the code are found after the snippet).

import re

str1 = "a black dog and black cat tore a black cloth"

pattern = r"(a|and) black (dog|cat)"

# re.finditer() returns a generator for all instance matches

matches = re.finditer(pattern, str1)

for match in matches:

print("The match:", match)

# Returns the entire match

print("Group 0: ", match.group(0))

# first captured group (the|and)

print("Group 1: ", match.group(1))

# second captured group (dog|cat)

print("Group 2: ", match.group(2))

# returns a tuple of all groups found.

print("All groups:", match.groups())

Output:

The match: <re.Match object; span=(0, 11), match='a black dog'>
Group 0:  a black dog
Group 1:  a
Group 2:  dog
All groups: ('a', 'dog')
The match: <re.Match object; span=(12, 25), match='and black cat'>
Group 0:  and black cat
Group 1:  and
Group 2:  cat
All groups: ('and', 'cat')

Explanation:

Our string: str1 = “a black dog and black cat tore a black cloth”

The pattern: pattern = r”(a|and) black (dog|cat)”

The pattern r”(a|and) black (dog|cat)” matches any substring that starts with a or and, followed by black, and ends with dog or cat.

Therefore, for our string, str1, this pattern matches “a black dog” (as group 1) and “and black cat” (as group 2).

Referencing Groups in Regex

There are two ways to reference groups:

\1, \2 up to \99 to refer to the corresponding captured group. However, note that \0 and any three-digit reference will be interpreted as an octal value and not a valid group reference. If you need to reference the 100th group or higher, use the second method.
\g<1>, \g<2>, …. (not limited to 99) to refer to the corresponding captured group. This method helps avoid ambiguity because 3-digits are allowed after group reference.

We will see more about the group referencing in the next section.

Replacing Groups in Regex

We need to do three things here: capture a group, create its reference and then replace it accordingly.

We will use the re.sub(pattern, replace, string) function to substitute matches in string with replace.

Let’s see some examples.

Example 1

import re

str1 = "a black dog and black cat tore a black cloth"

pattern = r"(a|and) black (dog|cat)"

# re.finditer() returns a generator for all instance matches

matches = [i for i in re.finditer(pattern, str1)]

print(matches)

# Replaces the matches with "xxx"

result0 = re.sub(pattern, r'xxx', str1)

print(result0)

# Matches for group 1

matches = [i.group(1) for i in re.finditer(pattern, str1)]

print(matches)

result1 = re.sub(pattern, r'\1xxx', str1) # or r'\g<1>xxx'

print(result1)

# Matches for group 2

matches = [i.group(2) for i in re.finditer(pattern, str1)]

print(matches)

# For replacing

result2 = re.sub(pattern, r'\g<2>xxx', str1) # or r'\2xxx'

print(result2)

Output:

[<re.Match object; span=(0, 11), match='a black dog'>, <re.Match object; span=(12, 25), match='and black cat'>]
xxx xxx tore a black cloth
['a', 'and']
axxx andxxx tore a black cloth
['dog', 'cat']
dogxxx catxxx tore a black cloth

Explanation:

As said before, r”(a|and) black (dog|cat)” matches a substring starting with a or and, followed by black and ends with dog or cat. This matches “a black dog” and “and black cat” in str1.
re.sub(pattern, r’xxx’, str1) replaces the two matches with “xxx”.
re.sub(pattern, r’\1xxx’, str1) or re.sub(pattern, r’\g<1>xxx’, str1) substitutes group 1 matches with group 1 values followed by “xxx”. The same explanation holds for matching group 2 using re.sub(pattern, r’\2xxx’, str1)

Example 2

import re

string = "254, 0704 502 7898"

# Three digit number (\d{3}) as a group

# followed by 1 or more white spaces (\s+)

pattern = r"(\d{3})\s+"

# Just to check the matches

matches = re.finditer(pattern, string)

for match in matches:

print("Match: ", match)

print("Group 1 match: ", match.group(1))

# replace matches with group 1 values + "xx"

result = re.sub(pattern, r"\1xx", string)

print(result)

Output:

Match:  <re.Match object; span=(6, 10), match='704 '>
Group 1 match:  704
Match:  <re.Match object; span=(10, 14), match='502 '>
Group 1 match:  502
254, 0704xx502xx7898

Explanation:

The pattern r”(\d{3})\s+” matches three-digit numbers, \d{3}, followed by one or more spaces, \s+, in 254, 0704 502 7898′. That means “704” and “502” are matched.
r”\1xx” means matches are replaced with group 1 values followed by “xx”

Example 3

The following example swaps two decimal numbers along the comma (,).

import re

coords = "0.71331, 52,25378"

coord_re = re.sub(r"(\d+[.,]\d+), (\d+[.,]\d+)", r"\2, \1", coords)

print(coord_re)

Output:

52,25378, 0.71331

Explanations:

\d+ means a number with at least 1 digit, [.,] means a set that matches comma and period. That means r” d+[.,]\d+” matches 0,718 and 0.718 and 22.89 and 52.25378, et cetera.
r”\2, \1″ means group 2 values come first before group 1. This swaps the position of group members.

Example 4

import re

result = re.sub(r"\[(\d+)\]", r"\g<1>3", "[12] pencils and [8] books")

print(result)

Output:

123 pencils and 83 books

Explanations:

r”\[(\d+)\]” matches a digit enclosed in square brackets, for example, [1] and [99]. The number inside the brackets is captured as a group.
r”\g<1>3″ replaces the matches with group 1 values (numbers in the string) followed by 3.

Example 5

In this example, we want to process image names. In particular, we want to remove the substring matching the pattern defined. Comments are included to explain what the code does.

import re

# Matches hyphen (-) followed by alphanumerics (\w+) as group 1

# followed by ".jpg" or ".png" or ".svg" as group 2

pattern = r"(\-\w+)(\.jpg|\.png|\.svg)"

# we want to remove the substring after the hyphen before the extension

images = [

"-abc123567-abc.jpg",

"test1.png",

"1267-exit.jpg",

"-abc34567-stuff.jpg",

"bxdsj-example.jpg",

"test1-smd.svg",

]

for image in images:

# r"\2" references group 2. Therefore the group 1(the unwanted)

# part is discarded

result = re.sub(pattern, r"\2", image)

print(result)

Output:

-abc123567.jpg
test1.png
1267.jpg
-abc34567.jpg
bxdsj.jpg
test1.svg

Conclusion

This post covered how to replace groups in Python regex by working on five examples. While going through these examples may help you understand how to replace groups in regex, you may find it useful to practice writing regular expressions to match your used case using these two sites: https://regexr.com/ and https://regex101.com/.

Codeigo

Just programming

Regex Replace Group in Python

Capturing Groups in Regex

Referencing Groups in Regex

Replacing Groups in Regex

Example 1

Example 2

Example 3

Example 4

Example 5

Conclusion