Natural Language Toolkit

One of the most helpful and powerful libraries I’ve used thus far in this process has been NLTK, which contains a bunch of modules that are very helpful for natural language processing. Below, I’ll outline a few basic uses.

In order to maximize consistency when processing a text dataset, it is vital to adequately pre-process and clean it so that when it is entered as input into our classifier, there are no inconsistencies or potentially algorithm breaking bits that ruin our runs. To do this by hand and go word by word, cleaning up a dataset would be an absolute hellish job, but thankfully we have trusty ol’ NLTK.

Lets examine NLTK for the purpose of creating something like a bag of words representation. A bag of words is a representation in which individual words of our text dataset are split into the own elements. Oftentimes, the words in a bag of words representation will disregard grammar and the order of words. Although it seems a little clunky, it is very useful for utilization with machine learning algorithms and classifiers as those algorithms often have issues with text based data. Our bag of words representation allows us to convert these basic elements of text into vectors of numbers, which are algorithms and classifiers are very happy to take in.

Breaking a massive text file down into a bunch of word elements would be a tedious task if we were to do this by hand. Luckily, NLTK has a tokenization function that does this for us and we can break text files down into sentences and words. The word tokenizer works through the detection of whitespace in the text file while the sentence tokenizer primarily analyzes punctuation. NLTK tokenization can be done with a single line of code in Python. This process yields a list of our divided tokenized elements.

Another one of these necessary cleaning procedures is the removal of stop words. Stop words are words that don’t necessarily generate lots of value and thus can be removed in some instances to ensure their appearance does not skew the dataset. Words such as ‘and’, ‘the’, ‘I’, ‘me’, and ‘had’ are included in this list. In a bag of words representation this practice is very necessary since words are independent of one another. However, removing stop words for sentiment analysis is sometimes bothersome as it may mess with the context of the sentence, thus this practice must be approached on a case-by-case basis. The NLTK library has a designated list of stop words and the user can also designate their own. To eliminate stop words with NLTK in Python we simply loop through our list of tokenized words and remove those that are present in the NLTK stop word list. Here’s the nifty source code.

from nltk.corpus import stopwords
from string import punctuation


stop_words = set(stopwords.words('english')+list(punctuation))
 all_words_wo_stop = [ w for w in all_words if not w in stop_words]

Stemming is also useful for this bag of word representation. It takes the various forms of words and returns to their most basic forms. Words such as ‘swam’ or ‘swimming’ will be transformed to ‘swim’ with the use of stemming. This allows for simplification for ease of processing within our bag of words model. The less complex the dataset is, the easier it is for our algorithms and classifiers to analyze. Another chunk of source code:

from nltk.stem import PorterStemmerfrom nltk.tokenize import word_tokenize


# We're gonna make all these weird forms into a single unified one!
ps = PorterStemmer()
sample = ["swim", "swimmer", "swimming", "swam", "swimmingly"]
for w in sample:
  print(ps.stem(w))
new_sample = "He swims swimmingly in the swimming pool because he is a swimming swimmer."
words = word_tokenize(new_sample)
for w in words:
  print(ps.stem(w))

Other functions of NLTK include part of speech tagging, which labels each tokenized words with their part of speech, chunking, which allows the user to designate a certain format of words that will be put together. For example, if I designate a chunk to be <noun><verb><noun>, NLTK will pull together all phrases that follow this order (such as ‘I ate pizza’).

import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("Where_Is_My_Son - 2019.txt")
sample_text = state_union.raw("Where_Is_My_Son - 2020.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

def process_content():
    for i in tokenized:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)

process_content()
'''
Here is the NLTK POS tag list! Here's how NLTK assigns the abbreviatoins:

CC	coordinating conjunction
CD	cardinal digit
DT	determiner
EX	existential there (like: "there is" ... think of it like "there exists")
FW	foreign word
IN	preposition/subordinating conjunction
JJ	adjective	'big'
JJR	adjective, comparative	'bigger'
JJS	adjective, superlative	'biggest'
LS	list marker	1)
MD	modal	could, will
NN	noun, singular 'desk'
NNS	noun plural	'desks'
NNP	proper noun, singular	'Tai'
NNPS	proper noun, plural	'Tais'
PDT	predeterminer	'all the kids'
POS	possessive ending	parent's
PRP	personal pronoun	I, he, she
PRP$	possessive pronoun	my, his, hers
RB	adverb	very, silently,
RBR	adverb, comparative	better
RBS	adverb, superlative	best
RP	particle	give up
TO	to	go 'to' the store.
UH	interjection	uhhhhhhhhhh
VB	verb, base form	take
VBD	verb, past tense	took
VBG	verb, gerund/present participle	taking
VBN	verb, past participle	taken
VBP	verb, sing. present, non-3d	take
VBZ	verb, 3rd person sing. present	takes
WDT	wh-determiner	which
WP	wh-pronoun	who, what
WP$	possessive wh-pronoun	whose
WRB	wh-abverb	where, when

These processes eliminate the need for us to physically endure the horrid task of archaic and tedious data cleaning (most of the time). Furthermore, these fast and fixed functions don’t run into the issues of human error.

However, a drawback of these functions is that they don’t have the conscious decision making ability of humans. It cannot analyze these inputs by a case by case basis and simply applies a blanket assessment to all of them. Sarcasm, stop words that actually hold meaning in a sentence: these things cannot be spotted by NLTK and thus there is a room for a little bit of error. Although NLTK is fast and reliable, there will be certain edge cases in which its accuracy may not be completely on point. It is by no means an absolutely perfect library (as most things do tend to be).

Natural Language Toolkit

Share this:

Leave a comment Cancel reply