Naive Bayes

The Naive Bayes algorithm is one of the most important algorithms for text classification in NLP. It's based on Bayes rule and treats the given documents as a "bag of words". 
In the bag of words representation of a document the algorithm may look at all words in the document, or it may look at a subset of words in the document. But either way, the bag of words representation of the document looses the order of the words within the document; all that remains is a set of words and their counts.

With Bayes rule, for a document d and a class c, our goal is to compute the probability of each class given a document, P(c|d), then we pick the class with the greatest probability for the classification of that document.

Bayes Rule

The above formula reads: The probability of a class given a document is the probability of the document given a class times the probability of the class over probability of the document.
We can simplify the equation by dropping the denominator, the probability of the document, P(d). This can be done because since we are calculating P(c|d) for all classes and comparing them, P(d), would be a constant in all of them, and therefore wouldn't change the outcome. 

Simplified Bayes Rule

The above is the equation that we need to compute for all classes and pick the class with the highest probability. So to formalize, the most likely class, CMAPis the class which maximizes the probability of the document given the class, P(c|d), a.k.a. likelyhood, times the probability of the class, P(c), a.k.a. prior.

Naive Bayes

Computing the probability of class, P(c), is relatively simple. It is basically asking how often does this class occur. So in our example of sentiment analysis, you could say, are positive reviews much more common than negative and neutral sentiments? 
This can be done by just counting relative frequencies of that class in some known comparable corpus or data-set (training data).

Computing the probability of document given a class, P(d|c), is a bit more involved. To do this, we can say, let's represent the document via a whole set of features, x1x2x3, ...xn,where these features are words and their count. So when we say probability of a document given a class, P(d|c), what we really mean is joint probability of a vector of features given the class, as shown below.

Joint Probabilities

It's worth to Note that in Naive Bayes, we make the following two assumptions to be able to efficiently compute P(d|c), as shown above.
  • Position of the word isn't important (bag of words)
  • The feature probabilities, P(xi|cj), are independent given the class c (conditional independence)
To formalize, the best class by Naive Bayes assumption, CNB, in text classification is as follows.

Naive Bayes for Text Classification
So to compute, for each class, cj
, we will first get the probability of that class, P(cj), and multiply that with the multiplication of probability of every single word, or feature, in the document given the class. So for class 1, c1,you compute the probability as follows.

Class1 Probability

You then compute the probably for the other classes in the same fashion, and pick the biggest probabily as the answer, CNB.