When working on an NLP project, removing punctuation marks from text data is a crucial pre-processing step to filter useless data (punctuation marks are not valuable for the most part).
This article discusses three methods that you can use to remove punctuation marks when working with the NLTK package (a crucial module when working on NLP) in Python:
- Method 1: Using nltk.tokenize.RegexpTokenizer() function,
- Method 2: Using re package, and,
- Method 3: Using <str>.translate() and str.maketrans() functions
Method 1: Using RegexpTokenizer() Package in nltk Module
The nltk.tokenize.RegexpTokenizer(<pattern>) function is used to tokenize a Python sentence using a regular expression.
To remove punctuations from the string, we will use the pattern “[a-zA-Z0-9]+” which captures alphanumeric characters only, that is, letters a to z (lowercase and uppercase) and numbers 0 through 9. Let’s see an example.
1 2 3 4 5 6 7 8 9 |
from nltk.tokenize import RegexpTokenizer str1 = "Think @%,^and analyze, analyze$ and& think. The_ loop*" # Create a tokenize based on a regular expression. # "[a-zA-Z0-9]+" captures all alphanumeric characters tokenizer = RegexpTokenizer(r"[a-zA-Z0-9]+") # Tokenize str1 words1 = tokenizer.tokenize(str1) print(words1) |
Output:
['Think', 'and', 'analyze', 'analyze', 'and', 'think', 'The', 'loop']
The output clearly shows that punctuation marks were eliminated.
However, note that the punctuations are not inside the words. Let’s see what happens if we have punctuation marks appearing within words.
1 2 3 4 5 |
from nltk.tokenize import RegexpTokenizer str2 = "Good @*,^muf$fins cost $3 in New #Yo^rk. Please *buy me two of them. Thanks." words2 = tokenizer.tokenize(str2) print(words2) |
Output:
['Good', 'muf', 'fins', 'cost', '3', 'in', 'New', 'Yo', 'rk', 'Please', 'buy', 'me', 'two', 'of', 'them', 'Thanks']
The output shows that the words containing punctuation marks within them were split into two. For example, “muffins” became [“muf”, “fins”]. If this is not what you expected, check the next method.
Method 2: Using re Package
This method uses the re.sub(pattern, repl, string) function to remove all punctuations (defined by the pattern) in the string by replacing them with repl.
In particular, we will use the pattern” [^\w\s]|_”. This pattern matches underscore (_) or [^\w\s] – all characters except \w (alphanumerics and underscore)and white space.
Note: In the following example, you do not need to use the NLTK tokenizer. You can simply split the string after stripping punctuations into a list of words (see the comments on the code).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import re import nltk str1 = "Think @%,^and analyze, analyze$ and& think. The_ loop*" str2 = "Good @*,^muf$fins cost $3 in N_ew #Yo^rk. Please *buy me two of them. Thanks." # Download Punkt sentence tokenization model for the English language nltk.download("punkt") # Replace punctuations with an empty string. str1 = re.sub(r"[^\w\s]|_", "", str1) # Tokenize the result words = nltk.tokenize.word_tokenize(str1) # If you do not want to use the nltk.tokenize.word_tokenize() # function, you can simply split the str1 into a list of words using # the following line # words = str1.split() print(words) str2 = re.sub(r"[^\w\s]|_", "", str2) words = nltk.tokenize.word_tokenize(str2) print(words) |
Output:
[nltk_data] Downloading package punkt to /home/kiprono/nltk_data... [nltk_data] Package punkt is already up-to-date! ['Think', 'and', 'analyze', 'analyze', 'and', 'think', 'The', 'loop'] ['Good', 'muffins', 'cost', '3', 'in', 'New', 'York', 'Please', 'buy', 'me', 'two', 'of', 'them', 'Thanks']
Method 3: Using <str>.translate() and str.maketrans() Functions
This method does not use the NLTK module to remove punctuation marks. Instead, it uses string.punctuation to get all punctuation marks and replace them using str.maketrans(). Here is an example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import string str1 = "Think @%,^and analyze, analyze$ and& think. The loop*" # Rmove punctuation marks using str.translate() by replacing # punctuation marks (string.punctuation) with "" processed_str1 = str1.translate (str.maketrans ("", "", string.punctuation)) print("Without punctuation marks: ", processed_str1) print("Punctuation marks that can be removed: ", string.punctuation) # Split the string into a list of words. words = processed_str1.split() print(words) str2 = "Good @*,^muf$fins cost $3 in New #Yo^rk. Please *buy me two of them. Thanks." processed_str2 = str2.translate (str.maketrans ("", "", string.punctuation)) print("Without punctuation marks: ", processed_str2) # Split the string into a list of words. words2 = processed_str2.split() print(words2) |
Output:
Without punctuation marks: Think and analyze analyze and think The loop Punctuation marks that can be removed: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ ['Think', 'and', 'analyze', 'analyze', 'and', 'think', 'The', 'loop'] Without punctuation marks: Good muffins cost 3 in New York Please buy me two of them Thanks ['Good', 'muffins', 'cost', '3', 'in', 'New', 'York', 'Please', 'buy', 'me', 'two', 'of', 'them', 'Thanks']
Conclusion
This post discussed three methods to remove punctuation marks when working with natural language. If you must use the NLTK module to do the job, use Method 1; otherwise, you can use the other two methods.