Let’s suppose we want to split the sentence: “I am coming home tomorrow” using whitespace. The first idea for most people might be to use <str>.split() function as follows:
1 2 3 |
str1 = "I am coming home tomorrow" splits = str1.split(" ") print(splits) |
Output:
['I', 'am', 'coming', 'home', 'tomorrow']
But what if we want to split “I9am10coming11home12tomorrow” using the numeric values within the string? In this case, using <str>.split() function is not an optimal choice. We now need a tool that can handle complex split patterns. The re module has a split() function, which fits that description. This article discusses how to use re.split() function to split strings based on some given patterns.
Syntax of re.split()
The general syntax for re.split() is given by:
re.split(pattern, str, maxsplit=0, flags=0)
Where:
- pattern – a regular expression we will use to provide a set of delimiters used for splitting,
- str – Python string we wish to split,
- maxsplit – (optional) is the maximum number of splits. The default value is 0, meaning str is split on all pattern occurrences. If maxsplit=1, str is split at the first occurrence resulting in a list of two elements; for maxsplit=2, the split is performed at first and second occurrences resulting in a list of tree substrings, and so on.
- flags – (optional) they are issued to control how matching is done; for example, there are some flags to control how encoding is done and others that determine whether pattern matching is case sensitive or not.
Examples – how to use re.split() function
Here is an example of how to use re.split() to split the string “I9am10coming11home12tomorrow” on the numeric values.
1 2 3 4 5 |
import re str1 = "I9am10coming11home12tomorrow" splits = re.split("\d+", str1) print(splits) |
Output:
['I', 'am', 'coming', 'home', 'tomorrow']
The output shows that the string str1 is split on numeric values: 9, 10, 11, and 12. The pattern used “\d+” (a special sequence) captures numbers with one digit or more. Let’s see more of these special sequences before we continue working on more examples.
Metacharacters and special character sequence
The following characters and sequences are helpful when writing patterns for re package (you can see more at https://docs.python.org/3/library/re.html).
Sequence(s) | Description | Match(es)? |
\w | Matches alphanumeric characters: a-z, A-Z, and 0-9. | Hav*i)ng &10 &fun (11 matches at Hav*i)ng &10 &fun) !#@$H#%#&$*()(%^ (1 match “H”) |
\W | Matches non-alphanumeric. Opposite of \w | |
\s | Matches white space | Pizza inn Lanet (2 matches) |
\S | Matches non-whitespace. | “I am” (3 matches) |
\d | Matches digits. Equivalent to [0-9]. | I9am10coming11home (5 matches – 9, 1, 0, 1, and 1). |
\D | Matches non-digits. | |
. | Matches any single character except the new line “\n”. | Pattern: “M.n” “Man” – match, “Men” – match, and,“Mean” – no match |
+ | Matches one or more occurrences of the pattern to the left | For pattern: ma+n, maaanman (2 matches maaan and man) |
* | Matches 0 or more occurrences of the expression to the left | For pattern: ma*n, maaanmn (2 matches maaan and mn) |
More examples of splitting a string using re.split()
Example 1: Splitting using backsplash (“\”)
1 2 3 4 5 |
import re str3 = r"path\location\search\home\directory" splits = re.split(r"\\", str3) print(splits) |
Output:
['path', 'location', 'search', 'home', 'directory']
In Python, backslash (\) is an escape character to introduce a special sequence (for example, “\n” means new line and “\t” means tab character). However, a backslash is treated as a literal string in a raw string (string preceded by “r” as in str3 above) or when another backslash precedes a backslash.
The re module also treats “\” as a special character; therefore, we use “\\” to ensure that backslash is seen as a literal string.
Example 2: Splitting using period (.)
As shown in the table above, the period is a special character. Therefore, when writing the pattern, we need to precede it with the escape character, \.
1 2 3 4 5 |
import re str4 = "The.elections.were.free.and.fair" splits = re.split("\.", str4) print(splits) |
Output:
['The', 'elections', 'were', 'free', 'and', 'fair']
Example 3: Splitting using several characters
1 2 3 4 5 |
import re str6 = "Going&back\on-Monday_or|Wednesday" splits = re.split("[&\\\\\-_|]", str6) print(splits) |
Output:
['Going', 'back', 'on', 'Monday', 'or', 'Wednesday']
The square brackets ([ ]) specify a set of characters you wish to match. In our case, we want to match &, \, -, _, and |. Note that the hyphen (-) is also a special character; therefore, to match it, we need to use “\-“
Example 4: Splitting given maxsplit value
1 2 3 4 5 |
import re string = 'Today is a good day.' result = re.split("\s+", string, maxsplit=2) print(result) |
Output:
['Today', 'is', 'a good day.']
Conclusion
The re module has the capability of picking complex patterns in Python strings. This article covered how to use re.split() function to split strings. Other useful functions in re include findall(), search(), and sub(). You can read more about these functions at https://docs.python.org/3/library/re.html.