Split String Based on Regex in Python

Let’s suppose we want to split the sentence: “I am coming home tomorrow” using whitespace. The first idea for most people might be to use <str>.split() function as follows:

str1 = "I am coming home tomorrow"
splits = str1.split(" ")
print(splits)

Output:

['I', 'am', 'coming', 'home', 'tomorrow']

But what if we want to split “I9am10coming11home12tomorrow” using the numeric values within the string? In this case, using <str>.split() function is not an optimal choice. We now need a tool that can handle complex split patterns. The re module has a split() function, which fits that description. This article discusses how to use re.split() function to split strings based on some given patterns.

Syntax of re.split()

The general syntax for re.split() is given by:

re.split(pattern, str, maxsplit=0, flags=0)

Where:

  • pattern – a regular expression we will use to provide a set of delimiters used for splitting,
  • str – Python string we wish to split,
  • maxsplit – (optional) is the maximum number of splits. The default value is 0, meaning str is split on all pattern occurrences. If maxsplit=1, str is split at the first occurrence resulting in a list of two elements; for maxsplit=2, the split is performed at first and second occurrences resulting in a list of tree substrings, and so on.
  • flags – (optional) they are issued to control how matching is done; for example, there are some flags to control how encoding is done and others that determine whether pattern matching is case sensitive or not.

Examples – how to use re.split() function

Here is an example of how to use re.split() to split the string “I9am10coming11home12tomorrow” on the numeric values.

import re
str1 = "I9am10coming11home12tomorrow"
splits = re.split("\d+", str1)
print(splits)

Output:

['I', 'am', 'coming', 'home', 'tomorrow']

The output shows that the string str1 is split on numeric values: 9, 10, 11, and 12. The pattern used “\d+” (a special sequence) captures numbers with one digit or more. Let’s see more of these special sequences before we continue working on more examples.

Metacharacters and special character sequence

The following characters and sequences are helpful when writing patterns for re package (you can see more at https://docs.python.org/3/library/re.html).

Sequence(s) Description Match(es)?
\w Matches alphanumeric characters: a-z, A-Z, and 0-9. Hav*i)ng &10 &fun (11 matches at Hav*i)ng &10 &fun)

!#@$H#%#&$*()(%^ (1 match “H”)

\W Matches non-alphanumeric. Opposite of \w 
\s Matches white space Pizza inn Lanet (2 matches)
\S Matches non-whitespace. “I am” (3 matches)
\d Matches digits. Equivalent to [0-9]. I9am10coming11home (5 matches – 9, 1, 0, 1, and 1).
\D Matches non-digits. 
. Matches any single character except the new line “\n”. Pattern: “M.n”

“Man” – match,

“Men” – match, and,

“Mean” – no match

+ Matches one or more occurrences of the pattern to the left For pattern: ma+n,

maaanman (2 matches maaan and man)

* Matches 0 or more occurrences of the expression to the left For pattern: ma*n,

maaanmn (2 matches maaan and mn)

More examples of splitting a string using re.split()

Example 1: Splitting using backsplash (“\”)

import re
str3 = r"path\location\search\home\directory"
splits = re.split(r"\\", str3)
print(splits)

Output:

['path', 'location', 'search', 'home', 'directory']

In Python, backslash (\) is an escape character to introduce a special sequence (for example, “\n” means new line and “\t” means tab character). However, a backslash is treated as a literal string in a raw string (string preceded by “r” as in str3 above) or when another backslash precedes a backslash.

The re module also treats “\” as a special character; therefore, we use “\\” to ensure that backslash is seen as a literal string.

Example 2: Splitting using period (.)

As shown in the table above, the period is a special character. Therefore, when writing the pattern, we need to precede it with the escape character, \.

import re
str4 = "The.elections.were.free.and.fair"
splits = re.split("\.", str4)
print(splits)

Output:

['The', 'elections', 'were', 'free', 'and', 'fair']

Example 3: Splitting using several characters

import re
str6 = "Going&back\on-Monday_or|Wednesday"
splits = re.split("[&\\\\\-_|]", str6)
print(splits)

Output:

['Going', 'back', 'on', 'Monday', 'or', 'Wednesday']

The square brackets ([ ]) specify a set of characters you wish to match. In our case, we want to match &, \, -, _, and |. Note that the hyphen (-) is also a special character; therefore, to match it, we need to use “\-“

Example 4: Splitting given maxsplit value

import re
string = 'Today is a good day.'
result = re.split("\s+", string, maxsplit=2)
print(result)

Output:

['Today', 'is', 'a good day.']

Conclusion

The re module has the capability of picking complex patterns in Python strings. This article covered how to use re.split() function to split strings. Other useful functions in re includes findall(), search() and sub(). You can read more about these functions at https://docs.python.org/3/library/re.html.