PPOL564 | DS1 | Foundations

Building a Naive Bayes Classifier

Download data used in this notebook from Dropbox.

Bayes Classifier

Let's use what we know of Bayes theorem to predict a binary outcome. If we think carefully, we were trying isolate the probability that $y = 1$ given the values of our predictor.

$$Pr(y=1 | X)$$

Put more generically,

$$Pr(class | data) $$

That is, the predicted class is conditional on the predictors.

We can use Bayes theorem here to construct a classifier that uses conditional probability to predict a class.



$$Pr(class | data) = \frac{Pr(data | class) Pr(class)}{Pr(data)}$$



This calculation can be performed for more than one class. We just need to calculate probability for each class in the problem. We then look at all the probabilities for each class assignment, and choose the largest (i.e. maximize).

However, such a classifier is difficult to compute.

"The conditional probability of the observation based on the class $Pr(data|class)$ is not feasible unless the number of examples is extraordinarily large, e.g. large enough to effectively estimate the probability distribution for all different possible combinations of values. This is almost never the case, we will not have sufficient coverage of the domain." (Brownlee)

As both the data grows (in either the number of parameters or the size of the data), the estimation task becomes more difficult.

Naive Bayes Classifier

We can greatly simplify the above equation by making a simple assumption: **that each variable is _independent_ of the other variables in the model**.




$$Pr(class | data) = \frac{Pr( x_1| class)\times Pr( x_2| class) \times \dots \times Pr(class)}{Pr(data)}$$




where $x_1$ and $x_2$ represent variables in the data.



We can further simplify by dropping the denominator. As the $Pr(data)$ is a normalizing constant it can be removed.




$$Pr(class | data) = Pr( x_1| class)\times Pr( x_2| class) \times \dots \times Pr(class)$$




This is known as a Naive Bayesian Classifier, or Naive Bayes. The "naivety" stems from the simplifying assumptions we make to the original Bayesian setup.




Building a Naive Bayesian Classifier

Let's build a Naive Bayes classifier on a binary outcome ( $y \in [0,1]$ ) with binary predictor variables. Below is a dataset that tracks when a country enters into a civil war given their level of economic development and political regime type:

Calculate Class Probabilities: $Pr(class)$

Calculate the Conditional Probabilities $Pr(data | class)$

Make a Prediction

Now we simply multiply together the probabilities for each outcome given some configuration of the variables.

For example, say we want to predict whether a country will enter into a civil war given its a developing democracy.

$.124$ is greater than $.015$ so we predict "no civil war" ($CW = 0$).

Now, let's use the probabilities to predict if an authoritarian developing country will enter into a civil war.

$.173$ is greater than $.089$ so we'll predict "civil war" for this observation ($CW = 1$).

In essence, we do this for every observation in the data. The idea is that we leverage conditional probabilities in the data to predict future class assignment, assuming the data generating process in the training data (i.e. the data we're learning on) is equivalent to the data we test on.

Predicting Multiple Observations

Let's now expand this setup so that we can calculate the underlying probabilities and then calculate the predictions for each observation in the data.

Now we've conveniently stored all the probabilities as dictionaries. Let's build a prediction function that combs through the observations in the data and calculates the probabilities and makes a class prediction.

Finally, let's calculate predictive accuracy (i.e. how many correct prediction did we make).

We obtained predictive accuracy of 76.3% on the training data, not bad!

Let's now try to predict the outcomes in the test data and see how we do.

We have an out of sample prediction of 90%.

Though the Naive Bayes Classifier is quite simplistic when compared to other modeling strategies (such as a neural net or a gradient boosting machine); however, it proves to be effective on a wide array of prediction tasks.

Naive Bayesian Classifier with Continuous Predictors

The modeling strategy outlined above assumes that we have binary or discrete predictors (i.e. 0/1). However, what if that's not the case, and we have continuous features that we want to use in the prediction task?

For this exercise, let's use the same data as we did in the gradient descent lecture.

As we did when in the gradient descent lecture, we need a way to map a continuous variable into a probability space. Here we'll use the probability density function for Gaussian (normal) distribution to convert continuous values into probabilities.

Note that we can use information regarding the distribution of each continuous predict and find out where any single point is on that continuous variables probability distribution.

We can calculate the conditional mean and standard deviation for each value of the outcome and then calculate the predictions from there for any one of our continuous variables.

Let's go through the steps again.

Calculate Class Probabilities: $Pr(class)$

Calculate the conditional means/standard deviations

View what these different conditional distributions look like:

Predict

Let's walk through predicting a single observation.

4.55% is greater than 1.89%, so we predict that $y = 1$

Predicting multiple observations

Examine the predictive accuracy of the training data.

Examine the predictive accuracy on the test data.

sklearn implementation

Consider alternative performance metrics.

Generate a ROC curve plot. First, we need the model to return probabilities not predictions.