This article will cover four things: the wrong way of finding duplicated words, how to correctly find repeated words in sentences in Python, count the occurrences of the duplicated, and, lastly, discuss how to deal with punctuations when counting duplicated words in a Python string.
A Wrong Way to Find Duplicated Words in a Python String
The first function that may come to mind when counting words in a string is <str>. count(<sub>) method. The <sub> in this method is a substring (NOT a word) to look for in <str>.
The following code shows how the method can yield wrong results.
1 2 3 4 5 6 7 8 |
str1 = "some text here" str2 = "some exercies makes one tiresome" # Correct results from <str>.count() print(str1.count("some")) # returns 1 # Wrong results from <str>.count() print(str2.count("some")) # returns 2 |
The <str>.count(<sub>) in the second example returned a wrong result because it counted “some” substring in the word “tiresome”.
Here are two methods to use to get the correct results
- Method 1: Using a for-loop and Python dictionary, and
- Method 2: Using collections.Counter() method
Method 1: Using a for-loop and Python Dictionary
This method accomplishes the task of counting duplicated words in three steps: turning Python string into a list of words, looping through the words in the list, counting the occurrence, filtering the results to get duplicates, and lastly, sorting the list to have the most duplicated word coming first.
Here is the code to this with comments to explain each line.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
def count_occurence1(str1): """ Input: str1 - A Python String Output: A sorted dictionary with the count of duplicated words. """ # Split the sentence into words along white spaces words = str1.split() # Initialize an empty dictionary to hold our counts. counts = {} # Loop through each word and count the occurrences for word in words: # If counts.get(word) returns None, it means # it is the first time finding that word # In that case, create key:value pay on counts # where the key is the word and value=1 (first occurrence) # else if the word already exists, increase the count # by 1 if counts.get(word) is None: counts[word] = 1 else: counts[word] += 1 # We want the repeated words that are key:value for which # value>1 duplicated = {key: value for key, value in counts.items() if value > 1} # Sort the resulting the dictionary based on item[1] which is the values # reverse=True means sort in descending order. # Reasoning: Dicts preserve insertion order in Python 3.7 and later. sorted_counts = dict( sorted(duplicated.items(), key=lambda item: item[1], reverse=True) ) # return results return sorted_counts # Initialize the first string str1 = "He took three thousand and three hundred and\ and thirty three and he needs more." # Call coutn_occurence1() function to find duplicates result1 = count_occurence1(str1) # Print results print(result1) str2 = "Some sentence here. Some sentence that will be checked." results2 = count_occurence1(str2) print(results2) |
Output:
{'and': 4, 'three': 3} {'Some': 2, 'sentence': 2}
Alternatively, you can use counts=collections.defaultdict(int) instead of initializing an empty dictionary as counts={}. The former creates any item that does not exist when you try to access it.
The counts=collections.defaultdict(int) means that the counts dictionary will only accept integers (int) as the values, not any other data type. See below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import collections def count_occurence_1a(str1): """ Input: str1: String we wish to process, Returns: A dictionary of the counts for duplicated words. """ words = str1.split() counts = collections.defaultdict(int) for word in words: counts[word] += 1 counts = {key: value for key, value in counts.items() if value > 1} sort_counts = dict(sorted(counts.items(), key=lambda item: item[1], reverse=True)) return sort_counts # Initialize the first string str1 = "He took three thousand and three hundred and\ and thirty three and he needs more." result1 = count_occurence_1a(str1) print(result1) str2 = "Some sentence here. Some sentence that will be checked." results2 = count_occurence_1a(str2) print(results2) |
Output:
{'and': 4, 'three': 3} {'Some': 2, 'sentence': 2}
Method 2: Using collection.Counter() Method
This method uses the collection.Counter(<iterable>) method is an integral part of counting word occurrences in a Python string. Once the counts are generated, we filter the duplicates and sort the resulting dictionary.
Here is an example in code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
import collections def count_occurence2(str1): """ Input: str1: String we wish to process, Output: A dictionary of the counts for duplicated words. """ words = str1.split() counts = {} # Use collections.Counter() counts occurrences of # each word in the 'words' list counts = collections.Counter(words) # .most_common(2) # Get the duplicated by showing with key:value pairs for # which values is at least two occurrences duplicated = {key: value for key, value in counts.items() if value >= 2} # Sort the dictionary using the values. sorted_counts = dict( sorted(duplicated.items(), key=lambda item: item[1], reverse=True) ) return sorted_counts # Initialize the first string str1 = "He took three thousand and three hundred and\ and thirty three and he needs more." # Call coutn_occurence2() function to find duplicates result1 = count_occurence2(str1) # Print results print(result1) str2 = "Some sentence here. Some sentence that will be checked." results2 = count_occurence2(str2) print(results2) |
Output:
{'and': 4, 'three': 3} {'Some': 2, 'sentence': 2}
Dealing with Punctuations when Counting Duplicated Words
Capitalization and punctuation marks in Strings may cause the above methods to return results that are not expected.
Let’s call the count_occurence2() function created in Method 2 to see possible problems with punctuation.
1 2 3 4 5 6 7 |
str3 = "Some sentence here. Some some sentence that will be checked." results3 = count_occurence2(str3) print(results3) str4 = "The rules are to be followed. The rules, in some cases, are$" results4 = count_occurence2(str4) print(results4) |
Output:
{'Some': 2, 'sentence': 2} {'The': 2}
The collections.Counter() used in the count_occurence2() function is case sensitive, which is why “Some” and “some” are treated as different words in the first example.
In the second case, punctuation marks affected our count, e.g., “rules,” and “are$” are counted as words. Ideally, we may want to remove punctuation marks to have two occurrences for “rules” and “are”.
To add those functionalities, we need to include two more arguments to our function case_sensitive and strip_punctuation, then add the following lines at the beginning of the function. See the complete code after this snippet.
1 2 3 4 5 6 7 8 |
if not case_sensitive: # Turn the string into lowercase so that, for example, # "Some" and "some" are counted as the same word str1 = str1.lower() if strip_punctuation == True: # Remove punction marks str1 = str1.translate(str.maketrans("", "", string.punctuation)) |
The full code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
import collections import string def count_occurence2_updated(str1, case_sensitive=False, strip_punctuation=True): """ Input: str1: String we wish to process, Output: A dictionary of the counts for duplicated words. """ if not case_sensitive: # Turn the string into lowercase so that, for example, # "Some" and "some" are counted as the same word str1 = str1.lower() if strip_punctuation == True: # Remove punction marks str1 = str1.translate(str.maketrans("", "", string.punctuation)) words = str1.split() counts = {} # Use collections.Counter() counts occurrences of # each word in the 'words' list counts = collections.Counter(words) # .most_common(2) # Get the duplicated by showing with key:value pairs for # which values is at least two occurrences duplicated = {key: value for key, value in counts.items() if value >= 2} # Sort the dictionary using the values. sorted_counts = dict( sorted(duplicated.items(), key=lambda item: item[1], reverse=True) ) return sorted_counts str3 = "Some sentence here. Some some sentence that will be checked." results3a = count_occurence2_updated(str3, case_sensitive=False) print(results3a) results3a = count_occurence2_updated(str3, case_sensitive=True) print(results3a) str4 = "The rules are to be followed. The rules, in some cases, are$" results4a = count_occurence2_updated(str4, case_sensitive=True, strip_punctuation=True) print(results4a) results4b = count_occurence2_updated(str4, case_sensitive=True, strip_punctuation=False) print(results4b) |
Output:
{'some': 3, 'sentence': 2} {'Some': 2, 'sentence': 2} {'The': 2, 'rules': 2, 'are': 2} {'The': 2}
Conclusion
Avoid using the str.count() method when counting words in a Python string. The function counts the substrings, not words. If you want to count the duplicated words, use the two methods discussed in the article.
If you want to deal with capitalization and punctuation marks appropriately, use the count_occurence2_updated() method discussed in the last Section. The function allows you to decide whether the counting should be case-sensitive and whether to strip punctuation marks.