Find Repeated Words in a String in Python

This article will cover four things: the wrong way of finding duplicated words, how to correctly find repeated words in sentences in Python, count the occurrences of the duplicated, and, lastly, discuss how to deal with punctuations when counting duplicated words in a Python string.

A Wrong Way to Find Duplicated Words in a Python String

The first function that may come to mind when counting words in a string is <str>. count(<sub>) method. The <sub> in this method is a substring (NOT a word) to look for in <str>.

The following code shows how the method can yield wrong results.

The <str>.count(<sub>) in the second example returned a wrong result because it counted “some” substring in the word “tiresome”.

Here are two methods to use to get the correct results

  • Method 1: Using a for-loop and Python dictionary, and
  • Method 2: Using collections.Counter() method

Method 1: Using a for-loop and Python Dictionary

This method accomplishes the task of counting duplicated words in three steps: turning Python string into a list of words, looping through the words in the list, counting the occurrence, filtering the results to get duplicates, and lastly, sorting the list to have the most duplicated word coming first.

Here is the code to this with comments to explain each line.

Output:

{'and': 4, 'three': 3}
{'Some': 2, 'sentence': 2}

Alternatively, you can use counts=collections.defaultdict(int) instead of initializing an empty dictionary as counts={}. The former creates any item that does not exist when you try to access it.

The counts=collections.defaultdict(int) means that the counts dictionary will only accept integers (int) as the values, not any other data type. See below.

Output:

{'and': 4, 'three': 3}
{'Some': 2, 'sentence': 2}

Method 2: Using collection.Counter() Method

This method uses the collection.Counter(<iterable>) method is an integral part of counting word occurrences in a Python string. Once the counts are generated, we filter the duplicates and sort the resulting dictionary.

Here is an example in code.

Output:

{'and': 4, 'three': 3}
{'Some': 2, 'sentence': 2}

Dealing with Punctuations when Counting Duplicated Words

Capitalization and punctuation marks in Strings may cause the above methods to return results that are not expected.

Let’s call the count_occurence2() function created in Method 2 to see possible problems with punctuation.

Output:

{'Some': 2, 'sentence': 2}
{'The': 2}

The collections.Counter() used in the count_occurence2() function is case sensitive, which is why “Some” and “some” are treated as different words in the first example.

In the second case, punctuation marks affected our count, e.g., “rules,” and “are$” are counted as words. Ideally, we may want to remove punctuation marks to have two occurrences for “rules” and “are”.

To add those functionalities, we need to include two more arguments to our function case_sensitive and strip_punctuation, then add the following lines at the beginning of the function. See the complete code after this snippet.

The full code:

Output:

{'some': 3, 'sentence': 2}
{'Some': 2, 'sentence': 2}
{'The': 2, 'rules': 2, 'are': 2}
{'The': 2}

Conclusion

Avoid using the str.count() method when counting words in a Python string. The function counts the substrings, not words. If you want to count the duplicated words, use the two methods discussed in the article.

If you want to deal with capitalization and punctuation marks appropriately, use the count_occurence2_updated() method discussed in the last Section. The function allows you to decide whether the counting should be case-sensitive and whether to strip punctuation marks.