Data-Cleaning

From what I’ve read/heard/learnt, data-cleaning is one of the most important data-set pre-processing steps out there. At the end of the day, we are feeding a mish-mash of text into a bunch of algorithms that sort through these words without any regard for any external considerations or precautions. These algorithms due their work in silence, without complaint or confusion: They simply chug away at whatever is fed into their mouths and spit out some result. Unlike humans, they cannot operate on a circumstantial basis and react to different situations on the fly. Our algorithms have a singular job that they do quite quickly and quite well, but outside of that rigid structure, they can become confused and do not adapt.

Thus to make life easier for our friends, we make sure the data we feed into them is nice and homogeneous. For the first run of this text sentiment analysis classification, I am using a bag of words classification discussed earlier in the Naive Bayes post. In this circumstance, I have opted to tokenize our texts, switch them to all lower case, remove all punctuation, remove any remaining non-alphabetical characters, remove all stopwords, lemmatize our dataset, and then remove another layer of text-specific stopwords after we run our first process. Lets run through each step real quick.

Tokenization: We read in our data-set and apply a word tokenizer to it, which splits apart all of the individual words and punctuation marks. This begins our bag-of-words representation

Swap to Lower Case: This creates consistency and removes the issue of proper nouns being capitalized and words at the start of sentences being capitalized. All words are now in the same case!

Removal of Punctuation: Our algorithms don’t really need punctuation as they just need to process the words.

Removal of Stopwords: Using NLTK’s standard stopword library as well as a couple of word additions of our own, we take out all the ‘useless’ words that have no meaning from our bag of words.

Lemmatization: Reduces all the various inflections of words to a singular form. For example, lemmatizing ‘geese’ yields goose. This makes it much easier on our algorithms as they don’t have to deal with a ton of different forms of words.

Here is how the code looks:


def process_workflow(text_in):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text_in)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    # remove remaining non-alphabetic words
    words = [word for word in stripped if word.isalpha()]
    stop_words = set(stopwords.words('english')+list(punctuation)+more_stop_words)
    cleaned = [w for w in words if w not in stop_words]
    super_clean = [lemmatizer.lemmatize(word) for word in cleaned]
    with io.open("D:\\Pomona\\Independent Research\\dataset\\all_book_cleaned.txt", 'a', encoding='utf8') as f:
        for sub in super_clean:
            try:
                f.write(sub + ' ')
            except:
                continue
        f.write('\n')
        f.close()

Now, we have completed the first phase of our cleaned up data-set. Our second phase now points itself towards a more literature specific step. Some stopwords, although not common enough in general speech or text, is quite prevalent in literature and science fiction. We need to get those out of the way so our algorithms have an easier job. To do this, we tally up the 100 most common words found through our entire array of texts. Here’s how that looks:


def find_stop_word(text_in):
    word_counts = collections.Counter(word for words in text_in for word in words)
    return word_counts.most_common(100)


books_in = open("D:\\Pomona\\Independent Research\\"
                "sentiment_analysis.py\\book_array_cleaned_stop.pickle", "rb")
book_array = pickle.load(books_in)
books_in.close()
print(find_stop_word(book_array))

After running this, this is the dictionary of words that came up, (‘word‘, # of occurrences):

[(‘said’, 28020), (‘one’, 20633), (‘would’, 18248), (‘could’, 15888), (‘know’, 12290), (‘time’, 11929), (‘like’, 11830), (‘u’, 10256), (‘back’, 9709), (‘way’, 8612), (‘even’, 8605), (‘see’, 8317), (‘get’, 7923), (‘thing’, 7597), (‘thought’, 6913), (‘think’, 6845), (‘first’, 6703), (‘man’, 6583), (‘hand’, 6429), (‘go’, 6416), (‘right’, 6330), (‘well’, 6321), (‘little’, 6255), (‘two’, 6199), (‘never’, 6105), (‘say’, 6024), (‘day’, 5893), (‘still’, 5871), (‘come’, 5868), (‘make’, 5795), (‘people’, 5753), (‘made’, 5594), (‘much’, 5427), (‘eye’, 5401), (‘long’, 5399), (‘got’, 5381), (‘ender’, 5148), (‘world’, 5012), (‘might’, 4999), (‘looked’, 4990), (‘something’, 4964), (‘came’, 4955), (‘going’, 4943), (‘knew’, 4939), (‘away’, 4928), (‘around’, 4916), (‘take’, 4777), (‘human’, 4727), (‘year’, 4722), (‘life’, 4710), (‘good’, 4688), (‘want’, 4681), (‘must’, 4652), (‘let’, 4611), (‘face’, 4516), (‘look’, 4435), (‘asked’, 4339), (‘nothing’, 4290), (‘old’, 4287), (‘work’, 4096), (‘tell’, 4096), (‘head’, 4077), (‘new’, 4005), (‘another’, 3997), (‘enough’, 3931), (‘place’, 3863), (‘every’, 3844), (‘mind’, 3795), (‘went’, 3706), (‘many’, 3705), (‘yes’, 3701), (‘left’, 3640), (‘saw’, 3624), (‘last’, 3601), (‘great’, 3590), (‘without’, 3506), (‘took’, 3496), (‘word’, 3493), (‘three’, 3441), (‘room’, 3433), (‘turned’, 3409), (‘men’, 3389), (‘ever’, 3379), (‘though’, 3368), (‘upon’, 3364), (‘course’, 3363), (‘ship’, 3305), (‘light’, 3297), (‘yet’, 3258), (‘mean’, 3238), (‘voice’, 3236), (‘found’, 3224), (‘moment’, 3142), (‘far’, 3124), (‘almost’, 3114), (‘seemed’, 3075), (‘door’, 3038), (‘anything’, 3034), (‘told’, 3030)]

Some interesting results here! ‘said’ obviously appears quite a lot, while ‘ender’ does get quite a bit of screen time considering we do have two Orson Scott Card books in here which exceed a couple thousand pages. I later ran another analysis, and the word ‘woman’ appeared 2950, versus the 3389 times ‘men’ appeared. Curiously, ‘woman’ appears at a high frequency in its singular form, while conversely ‘men’ comes about in its plural form. Make of that what you will.

Looking through some of those words, I took some of the ‘useless’ ones and added them to our stopword list. I then ran the earlier data-cleaning code again, and now those have been pared from our data-set to simplify it even further. Now, we can just mess around with a few more tweaks before moving on to running our classifiers on this dataset! Time for some answers.

Data-Cleaning

Share this:

Leave a comment Cancel reply