Regex Replace Group in Python

This article goes through two things: how to capture groups in Python regex and how to replace the groups within a string.

Capturing Groups in Regex

The parenthesis ()is used to group sub-patterns in regex. For example, the pattern (a|b)xz(c|d) matches any substring that starts with a or b, followed by xz, and then ends with c or d.

The first sub-pattern, (a|b), is said to belong to group 1, and the second, (c|d), is said to be of group 2. Let’s see an example in code (some explanations about the code are found after the snippet).

Output:

The match: <re.Match object; span=(0, 11), match='a black dog'>
Group 0:  a black dog
Group 1:  a
Group 2:  dog
All groups: ('a', 'dog')
The match: <re.Match object; span=(12, 25), match='and black cat'>
Group 0:  and black cat
Group 1:  and
Group 2:  cat
All groups: ('and', 'cat')

Explanation:

Our string: str1 = “a black dog and black cat tore a black cloth”

The pattern: pattern = r”(a|and) black (dog|cat)”

The pattern r”(a|and) black (dog|cat)” matches any substring that starts with a or and, followed by black, and ends with dog or cat.

Therefore, for our string, str1, this pattern matches “a black dog” (as group 1) and “and black cat” (as group 2).

Referencing Groups in Regex

There are two ways to reference groups:

  • \1, \2 up to \99 to refer to the corresponding captured group. However, note that \0 and any three-digit reference will be interpreted as an octal value and not a valid group reference. If you need to reference the 100th group or higher, use the second method.
  • \g<1>, \g<2>, …. (not limited to 99) to refer to the corresponding captured group. This method helps avoid ambiguity because 3-digits are allowed after group reference.

We will see more about the group referencing in the next section.

Replacing Groups in Regex

We need to do three things here: capture a group, create its reference and then replace it accordingly.

We will use the re.sub(pattern, replace, string) function to substitute matches in string with replace.

Let’s see some examples.

Example 1

Output:

[<re.Match object; span=(0, 11), match='a black dog'>, <re.Match object; span=(12, 25), match='and black cat'>]
xxx xxx tore a black cloth
['a', 'and']
axxx andxxx tore a black cloth
['dog', 'cat']
dogxxx catxxx tore a black cloth

Explanation:

  • As said before, r”(a|and) black (dog|cat)” matches a substring starting with a or and, followed by black and ends with dog or cat. This matches “a black dog” and “and black cat” in str1.
  • re.sub(pattern, r’xxx’, str1) replaces the two matches with “xxx”.
  • re.sub(pattern, r’\1xxx’, str1) or re.sub(pattern, r’\g<1>xxx’, str1) substitutes group 1 matches with group 1 values followed by “xxx”. The same explanation holds for matching group 2 using re.sub(pattern, r’\2xxx’, str1)

Example 2

Output:

Match:  <re.Match object; span=(6, 10), match='704 '>
Group 1 match:  704
Match:  <re.Match object; span=(10, 14), match='502 '>
Group 1 match:  502
254, 0704xx502xx7898

Explanation:

  • The pattern r”(\d{3})\s+” matches three-digit numbers, \d{3}, followed by one or more spaces, \s+, in 254, 0704 502 7898′. That means “704” and “502” are matched.
  • r”\1xx” means matches are replaced with group 1 values followed by “xx”

Example 3

The following example swaps two decimal numbers along the comma (,).

Output:

52,25378, 0.71331

Explanations:

  • \d+ means a number with at least 1 digit, [.,] means a set that matches comma and period. That means r” d+[.,]\d+” matches 0,718 and 0.718 and 22.89 and 52.25378, et cetera.
  • r”\2, \1″ means group 2 values come first before group 1. This swaps the position of group members.

Example 4

Output:

123 pencils and 83 books

Explanations:

  • r”\[(\d+)\]” matches a digit enclosed in square brackets, for example, [1] and [99]. The number inside the brackets is captured as a group.
  • r”\g<1>3″ replaces the matches with group 1 values (numbers in the string) followed by 3.

Example 5

In this example, we want to process image names. In particular, we want to remove the substring matching the pattern defined. Comments are included to explain what the code does.

Output:

-abc123567.jpg
test1.png
1267.jpg
-abc34567.jpg
bxdsj.jpg
test1.svg

Conclusion

This post covered how to replace groups in Python regex by working on five examples. While going through these examples may help you understand how to replace groups in regex, you may find it useful to practice writing regular expressions to match your used case using these two sites: https://regexr.com/ and https://regex101.com/.