Algorithms!

The past three weeks of this process have been spent mostly working with some foundational natural language processing and sentiment analysis related bits and pieces. A couple of modules and libraries I’ve become a little more familiarized with hold extremely powerful features, while a couple others I have tried are a little more clunky and inaccurate.

At a base level, one of the most important contributors to overall accuracy of analysis is of course the algorithm used in the analysis. I tested a variety of classification algorithms on a labeled dataset of movie reviews, comparing the classification accuracy of each. The purpose of a classification algorithm is too correctly predict whether our dataset holds words that skew more towards the positive or negative spectrum. By altering the way in which we calculate these outputs, we will see different results of accuracy.

To verify how each classifier was doing in terms of accuracy, I took a dataset of 3000 positive and 3000 negative movie reviews that came packaged with positive and negative labels (my training data) and trained my classifier on those reviews. I then scrambled the words of the movie reviews (my testing data) and fed them back into the trained classifiers. Since everything was already labeled with positive and negative, this allowed me to generate an accuracy score for each classifier. I’ll outline a couple of algorithms and their effectiveness below:

Starting off the pack is Naive Bayes. This guy is a pretty simple one, and although it does not utilize complicated mathematical procedures or complex matching, it is quick to implement and scales well with larger datasets. Since it is relatively simple, I’ll outline it below.

Naive Bayes is based on Bayes’ Theorem, a simple equation of probability used to calculate the probability an event occurs given a certain piece of evidence (conditional probability). An example of this is the probability that a card drawn from a deck of cards is a king given the card drawn is a face card. Here is Bayes’ Theorem where P(H|E) is defined to be the probability of H given E.

$P(\textbf{H}|\textbf{E}) = P(\textbf{H} ) \frac{P(\textbf{E} |\textbf{H})}{P(\textbf{E})}$

Applying our example with the probability of drawing a king given a face card we have:

$P(\textbf{King}|\textbf{Face}) = P(\textbf{King} ) \frac{P(\textbf{Face} |\textbf{King})}{P(\textbf{Face})}$

We know that the probability of drawing a Face card given that it is a King is 100%. We also know that the probability that we draw a face card is 12/52 = 3/13 (4 Queens, 4 Jacks, 4 Kings) and we know that the probability we draw a king is 4/52 = 1/13. Simplifying that equation out, we have:

$P(\textbf{King}|\textbf{Face}) = 1(1/13)/(3/13) = {1/3}$

And there you go! Pretty easy way to figure it out. The proof to this equation is also just a google away and fairly digestible.

Now we have this lovely tool for figuring out probabilities. Perhaps we can throw this guy directly into whether our sentences are positive or negative? Not quite yet. The issue with Bayes’ Theorem is that it cannot handle more then one single event (in our example case the single event was draw king given face). If we want to figure out whether a sentence is positive or negative based off whether the words composing the sentence are positive or negative, we run into this issue. To overcome this, we use Naive Bayes, an extension of Bayes’ Theorem that is able to take in multiple events on the assumption that all events are independent of one another. Here is how the math looks:

$P\left(\textbf{H}|\textbf{E}\right)=\prod_{i=1}^{P}P\left(\textbf{H}_{i}|\textbf{E}\right)\\$

Its a little funky looking when represented in that manner, but all that equation is saying is that we now multiply each independent conditional together and also divide by the product of the probability of those independent events. In words its a convoluted mess that makes absolutely no sense so lets look at an example.

Say I want to figure out the probability a chunk out of a paragraph is negative given the words ‘terrible’, ‘stupid’, and ‘unacceptable’. Our Naive Bayes Equation would look like this:

$P(\textbf{Negative}|\textbf{terrible, stupid, unacceptable}) = P(\textbf{negative}) \frac{P(\textbf{terrible} |\textbf{negative}) \cdot P(\textbf{stupid} |\textbf{negative}) \cdot P(\textbf{unacceptable} |\textbf{negative})}{P(\textbf{terrible}) \cdot P(\textbf{stupid}) \cdot P(\textbf{unacceptable})}$

Its essentially our earlier Bayes’ Theorem, except we multiply out all those different conditional probabilities and divide by all those event probabilities. Lets assign some arbitrary values to these probabilities so we can do some calculations with them:

$P\left(\textbf{terrible}|\textbf{negative}\right) = 24/30, \newline P\left(\textbf{stupid}|\textbf{negative}\right) = 20/30, \newline P\left(\textbf{unacceptable}|\textbf{negative}\right) = 25/30, \newline P(\textbf{negative})=35/70, \newline P(\textbf{terrible})=4/60, \newline P(\textbf{stupid})=8/60, \newline P(\textbf{unacceptable})=9/60$

Putting this into our equation, we get:

We got a pretty low number (admittedly because I chose bad arbitrary values to assign), thus we can assume based off those words alone, that the probability the paragraph is negative is relatively low. Apply that algorithm to a bunch of words and a bunch of sentences/paragraphs and you’ve got yourself a positive/negative sentiment analysis.

Now Naive Bayes is also sometimes called Idiot’s Bayes because of this independent event assumption. We just assume that all of these conditionals do not have any effect on one another: They are just terms to be multiplied, no further examination needed. Naive Bayes does its job quick and easy. There are other algorithms that use different calculation methods to get the job done in a perhaps more accurate way.

Some examples of these more complex algorithms include a Multinomial Naive Bayes classifier, which takes that Naive Bayes algorithm and adds some extra shine in the form of Laplace Smoothing, logarithmic sums, and other big words that probably don’t mean all too much at this point and time. We also have a classifier called Support Vector Machine which splits our data-set between positive and negative with a hyperplane that best distinctively classifies our data-points.

Utilizing the nifty prepackaged algorithms from the NLTK library, I ran an accuracy test on that movie reviews data-set with a handful of algorithms. Here is what they came out to:

Naive Bayes: 94% (This guy is prone to fluctuation and this just happened to be a good run)

Multinomial Naive Bayes: 82%

Bernoulli Naive Bayes: 82%

Logistic Regression Classifier: 85%

Stochastic Gradient Descent Classifier: 81%

Support Vector Machine Classifier: 88%

Linear Support Vector Machine Classifier: 81%

Nu Support Vector Machine Classifier: 86%

Funnily enough, trusty ol’ Naive Bayes achieved the most optimal results, although running Naive Bayes repeatedly produced a figure that fluctuated often (79-94%: This fluctuation occurs due to how I created my testing set. The testing set was created by taking a document of text, shuffling the words about, and taking the second half of it. This shuffling causes the testing set to differ every time, and thus accuracy differs.) In order to receive a more consistent and accurate output, I implemented a ‘voting’ system. By taking in and creating a list of all the output of each of these classifiers (positive/negative) and then choosing the mode, I could get a result that was much better than any model alone. For example, if for the analysis of a block of text, Naive Bayes, Multinomial Naive Bayes, Stochastic Gradient Descent, Bernoulli Naive Bayes, and Linear Support Vector Machine all output ‘Negative’ while Support Vector Machine and Nu Support Vector Machine all output ‘Positive’, since ‘Negative’ is output more overall, my final output for this analysis will be ‘Negative’. This ‘voting’ of sorts allows for an aggregation of all these models results and thus more consistency.

This is not the only way to optimize results of sentiment analysis. Each of the parameters of these models can be tweaked through NLTK to create a much more accurate model fitted towards ones own circumstances. These accessible tools are extremely powerful and customizable to the average user who wishes to try their hand. One just has to be willing to sift through the seemingly endless guides and videos and documentations and let the magic happen.

Algorithms!

Share this:

Leave a comment Cancel reply