This article will cover two methods of finding common words between two Python strings:
- Method 1: Using for-loop and list comprehension,
- Method 2: Using the set intersection
We will also discuss how to deal with capitalization and punctuation marks when finding these words.
Let’s define the strings we will be using in our examples beforehand.
1 2 3 4 |
str1 = "This is the first string" str2 = "This is another string" str3 = "Category spoken articles an organized list of all spoken articles" str4 = "Wikipedia spoken articles some general information about the spoken article technology" |
Method 1: Using for-loop and list comprehension
This method accomplishes the task in two steps: split the given strings into lists of words, and loop through one of the lists, checking if a given word exists in the other list.
Here is an example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def FindCommonWords_ForLoop(str1, str2): # Convert str1 and str2 into lists of words using str.split() function words_str1 = str1.split() words_str2 = str2.split() # Initialize an empty list that will hold the common words common = [] for word in words_str1: # Loop through the words in words_str1 if word in words_str2 and word not in common: # Append word to common if the word is in words_str2 as well. # Only append if the word is not already in the common list. common.append(word) return common # Call FindCommonWords_ForLoop() to find the common words in strings. results1 = FindCommonWords_ForLoop(str1, str2) print(results1) results2 = FindCommonWords_ForLoop(str3, str4) print(results2) |
Output:
['This', 'is', 'string'] ['spoken', 'articles']
The for-loop in the code snippet above can be reduced into a one-liner list comprehension as follows
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def FindCommonWords_LC(str1, str2): # Split the strings into lists of words words_str1 = str1.split() words_str2 = str2.split() # A list comprehension to check for common words # set() ensures that we keep unique words in the final common list. common = list(set([item for item in words_str1 if item in words_str2])) return common results_lc = FindCommonWords_LC(str1, str2) print(results_lc) results_lc2 = FindCommonWords_LC(str3, str4) print(results_lc2) |
Output:
['string', 'This', 'is'] ['spoken', 'articles']
Method 2: Using the set intersection
The intersection of two sets, A and B, denoted by A B, is the set containing all elements of A that also belong to B, and vice-versa.
Python supports the concept of set intersection natively. For example,
1 2 3 4 5 |
A = {1, 2, 3} # set A B = {2, 3, 4} # set B AB = A.intersection(B) # Intersection between the two sets. print(AB) |
Output:
{2, 3}
Similarly, we can use the concept of intersection to find common words between two strings.
First, we need to split the strings into lists of words, convert the lists into sets then find their intersection.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def FindCommonWords(str1, str2): # Split the strings into lists of words str1_words = str1.split() str2_words = str2.split() # Find the intersection of the sets of str1_words and str2_words # set(<list>) function gets the unique words in a list # eg set(["Simon", "Allan", "Simon"]) = {"Simon", "Allan"} # set1.intersection(set2) gives the common words in set1 and set2 # e.g. {"Simon", "Allan"}.intersection({"Simon", "Alice"}) = {"Simon"} common = list(set(str1_words).intersection(set(str2_words))) # If you want unique words, use the following line # unique = set(str1_words).symmetric_difference(set(str2_words)) return common # Call FindCommonWords() function to find common words in the strings. result = FindCommonWords(str1, str2) print(result) result1 = FindCommonWords(str3, str4) print(result1) |
Output:
['is', 'string', 'This'] ['spoken', 'articles']
The idea of the set intersection to find common words in Python strings can also be implemented in NumPy. Here is an example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import numpy as np def FindCommonWords_Numpy(str1, str2): # words on str1 as NumPy array str1_array = np.array(str1.split()) str2_array = np.array(str2.split()) # 1-dimensional intersection using NumPy common = np.intersect1d(str1_array, str2_array) return common # Using variables defined at the beginning of the article. result = FindCommonWords_Numpy(str1, str2) print(result) result = FindCommonWords_Numpy(str3, str4) print(result) |
Output:
['This' 'is' 'string'] ['articles' 'spoken']
Dealing with Capitalization and Punctuation marks
So far, we have not discussed how to handle capitalization and punctuation marks in our strings. For example, in the functions above, “Spoken”, “SpoKen”, and “spoken” will be treated as different words. And so are “business;” and “business,”.
For example, the following code returns no common words because of punctuation. Ideally, we might have expected to have “Spoken” and “articles” to be captured as common words:
1 2 3 4 5 6 |
str7 = "Category, Spoken articles, - an organized list of; all spoken: Articles" str8 = "Wikipedia; Spoken, articles$- some general- information about the spoken article technology" # Calling FindCommonWords() function defined earlier results = FindCommonWords(str7, str8) print(results) # returns an empty list - no common words |
We will add two more arguments to the FindCommonWords () function to control the search based on punctuation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
str7 = "Category, Spoken articles, - an organized list of; all spoken: Articles" str8 = "Wikipedia; Spoken, articles$- some general- information about the spoken article technology" def FindCommonWords(str1, str2, remove_punctuation=True, case_sensitive=True): # Split the strings into lists of words if remove_punctuation: # Removes all punctuations marks on the string str1 = str1.translate(str.maketrans("", "", string.punctuation)) str2 = str2.translate(str.maketrans("", "", string.punctuation)) if not case_sensitive: # These lines turn all characters in the strings into lower cases str1 = str1.lower() str2 = str2.lower() str1_words = str1.split() str2_words = str2.split() common = list(set(str1_words).intersection(set(str2_words))) return common # The search is case-sensitive, and punctuation marks are not removed result = FindCommonWords(str7, str8, case_sensitive=True, remove_punctuation=False) print(result) # Case sensitive set to false and punctuation marks not removed result = FindCommonWords(str7, str8, case_sensitive=False, remove_punctuation=False) print(result) # Case sensitive set to false and punctuation removed result = FindCommonWords(str7, str8, case_sensitive=False, remove_punctuation=True) print(result) |
Output:
[] ['spoken'] ['articles', 'spoken']
Conclusion
This article discussed two methods of finding common words in two Python strings using for-loop and set intersection. We also discussed how to handle strings with capitalization and punctuation marks.