One of the most helpful and powerful libraries I’ve used thus far in this process has been NLTK, which contains a bunch of modules that are very helpful for natural language processing. Below, I’ll outline a few basic uses.
In order to maximize consistency when processing a text dataset, it is vital to adequately pre-process and clean it so that when it is entered as input into our classifier, there are no inconsistencies or potentially algorithm breaking bits that ruin our runs. To do this by hand and go word by word, cleaning up a dataset would be an absolute hellish job, but thankfully we have trusty ol’ NLTK.
Lets examine NLTK for the purpose of creating something like a bag of words representation. A bag of words is a representation in which individual words of our text dataset are split into the own elements. Oftentimes, the words in a bag of words representation will disregard grammar and the order of words. Although it seems a little clunky, it is very useful for utilization with machine learning algorithms and classifiers as those algorithms often have issues with text based data. Our bag of words representation allows us to convert these basic elements of text into vectors of numbers, which are algorithms and classifiers are very happy to take in.
Breaking a massive text file down into a bunch of word elements would be a tedious task if we were to do this by hand. Luckily, NLTK has a tokenization function that does this for us and we can break text files down into sentences and words. The word tokenizer works through the detection of whitespace in the text file while the sentence tokenizer primarily analyzes punctuation. NLTK tokenization can be done with a single line of code in Python. This process yields a list of our divided tokenized elements.
Another one of these necessary cleaning procedures is the removal of stop words. Stop words are words that don’t necessarily generate lots of value and thus can be removed in some instances to ensure their appearance does not skew the dataset. Words such as ‘and’, ‘the’, ‘I’, ‘me’, and ‘had’ are included in this list. In a bag of words representation this practice is very necessary since words are independent of one another. However, removing stop words for sentiment analysis is sometimes bothersome as it may mess with the context of the sentence, thus this practice must be approached on a case-by-case basis. The NLTK library has a designated list of stop words and the user can also designate their own. To eliminate stop words with NLTK in Python we simply loop through our list of tokenized words and remove those that are present in the NLTK stop word list. Here’s the nifty source code.
from nltk.corpus import stopwords
from string import punctuation
stop_words = set(stopwords.words('english')+list(punctuation))
all_words_wo_stop = [ w for w in all_words if not w in stop_words]
Stemming is also useful for this bag of word representation. It takes the various forms of words and returns to their most basic forms. Words such as ‘swam’ or ‘swimming’ will be transformed to ‘swim’ with the use of stemming. This allows for simplification for ease of processing within our bag of words model. The less complex the dataset is, the easier it is for our algorithms and classifiers to analyze. Another chunk of source code:
from nltk.stem import PorterStemmerfrom nltk.tokenize import word_tokenize # We're gonna make all these weird forms into a single unified one! ps = PorterStemmer() sample = ["swim", "swimmer", "swimming", "swam", "swimmingly"] for w in sample: print(ps.stem(w)) new_sample = "He swims swimmingly in the swimming pool because he is a swimming swimmer." words = word_tokenize(new_sample) for w in words: print(ps.stem(w))
Other functions of NLTK include part of speech tagging, which labels each tokenized words with their part of speech, chunking, which allows the user to designate a certain format of words that will be put together. For example, if I designate a chunk to be <noun><verb><noun>, NLTK will pull together all phrases that follow this order (such as ‘I ate pizza’).
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
train_text = state_union.raw("Where_Is_My_Son - 2019.txt")
sample_text = state_union.raw("Where_Is_My_Son - 2020.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
process_content()
'''
Here is the NLTK POS tag list! Here's how NLTK assigns the abbreviatoins:
CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: "there is" ... think of it like "there exists")
FW foreign word
IN preposition/subordinating conjunction
JJ adjective 'big'
JJR adjective, comparative 'bigger'
JJS adjective, superlative 'biggest'
LS list marker 1)
MD modal could, will
NN noun, singular 'desk'
NNS noun plural 'desks'
NNP proper noun, singular 'Tai'
NNPS proper noun, plural 'Tais'
PDT predeterminer 'all the kids'
POS possessive ending parent's
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to go 'to' the store.
UH interjection uhhhhhhhhhh
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
These processes eliminate the need for us to physically endure the horrid task of archaic and tedious data cleaning (most of the time). Furthermore, these fast and fixed functions don’t run into the issues of human error.
However, a drawback of these functions is that they don’t have the conscious decision making ability of humans. It cannot analyze these inputs by a case by case basis and simply applies a blanket assessment to all of them. Sarcasm, stop words that actually hold meaning in a sentence: these things cannot be spotted by NLTK and thus there is a room for a little bit of error. Although NLTK is fast and reliable, there will be certain edge cases in which its accuracy may not be completely on point. It is by no means an absolutely perfect library (as most things do tend to be).
Leave a comment