Remove Punctuation with NLTK in Python

When working on an NLP project, removing punctuation marks from text data is a crucial pre-processing step to filter useless data (punctuation marks are not valuable for the most part).

This article discusses three methods that you can use to remove punctuation marks when working with the NLTK package (a crucial module when working on NLP) in Python:

  • Method 1: Using nltk.tokenize.RegexpTokenizer() function,
  • Method 2: Using re package, and,
  • Method 3: Using <str>.translate() and str.maketrans() functions

Method 1: Using RegexpTokenizer() Package in nltk Module

The nltk.tokenize.RegexpTokenizer(<pattern>) function is used to tokenize a Python sentence using a regular expression.

To remove punctuations from the string, we will use the pattern “[a-zA-Z0-9]+” which captures alphanumeric characters only, that is, letters a to z (lowercase and uppercase) and numbers 0 through 9. Let’s see an example.

Output:

['Think', 'and', 'analyze', 'analyze', 'and', 'think', 'The', 'loop']

The output clearly shows that punctuation marks were eliminated.

However, note that the punctuations are not inside the words. Let’s see what happens if we have punctuation marks appearing within words.

Output:

['Good', 'muf', 'fins', 'cost', '3', 'in', 'New', 'Yo', 'rk', 'Please', 'buy', 'me', 'two', 'of', 'them', 'Thanks']

The output shows that the words containing punctuation marks within them were split into two. For example, “muffins” became [“muf”, “fins”]. If this is not what you expected, check the next method.

Method 2: Using re Package

This method uses the re.sub(pattern, repl, string) function to remove all punctuations (defined by the pattern) in the string by replacing them with repl.

In particular, we will use the pattern” [^\w\s]|_”. This pattern matches underscore (_) or [^\w\s] – all characters except \w (alphanumerics and underscore)and white space.

Note: In the following example, you do not need to use the NLTK tokenizer. You can simply split the string after stripping punctuations into a list of words (see the comments on the code).

Output:

[nltk_data] Downloading package punkt to /home/kiprono/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Think', 'and', 'analyze', 'analyze', 'and', 'think', 'The', 'loop']
['Good', 'muffins', 'cost', '3', 'in', 'New', 'York', 'Please', 'buy', 'me', 'two', 'of', 'them', 'Thanks']

Method 3: Using <str>.translate() and str.maketrans() Functions

This method does not use the NLTK module to remove punctuation marks. Instead, it uses string.punctuation to get all punctuation marks and replace them using str.maketrans(). Here is an example.

Output:

Without punctuation marks:  Think and analyze analyze and think The loop
Punctuation marks that can be removed:  !"#$%&'()*+,-./:;<=>[email protected][\]^_`{|}~
['Think', 'and', 'analyze', 'analyze', 'and', 'think', 'The', 'loop']
Without punctuation marks:  Good muffins cost 3 in New York Please buy me two of them Thanks
['Good', 'muffins', 'cost', '3', 'in', 'New', 'York', 'Please', 'buy', 'me', 'two', 'of', 'them', 'Thanks']

Conclusion

This post discussed three methods to remove punctuation marks when working with natural language. If you must use the NLTK module to do the job, use Method 1; otherwise, you can use the other two methods.