Naive Bayes Classification Simple Explanation


A Naive Bayes classifier is a very simple tool in the data mining toolkit.  Think of it like using your past knowledge and mentally thinking “How likely is X… How likely is Y…etc.”

What is Naive Bayes Classification

Naive Bayes is one of the easiest to implement classification algorithms.  Given some set of possible classes (e.g. child, adult, senior or G, PG, PG-13, R) you use the patterns inside those classes to label some new, unclassified data.

Its biggest benefit is that you really only need to count how often each variable’s distinct values occur for each class.  So if you have five variables averaging three distinct values with two possible classes, that’s 5 x 3 x 2 = 30 pieces of information you need to store.

You’ll also need the “prior probability” for each class.  Simply put, you count up all of the instances of each class and divide by the total of all instances.

Naive Bayes makes use of Bayes Theorem but assumes that all variables in the model are independent of each other.  This is convenient because it lets you multiply everything together without having to calculate the more complicated conditional probabilities that probability theory would require.
 
 

Calculating and Combining the Probabilities

If you can divide, you can do Naive Bayes.

Class Visited Other Dept Did Not Visit Other Dept Came Alone Did Not Come Alone Total
Browser 200 0 75 125 200
Easy Sale 150 200 250 100 350
Big Sale 60 60 0 120 120

.Using the example data above, we can work out a few probabilities.

  • What’s the prior (i.e. overall) probability of the customer being a “Browser”?
    • 200 / (200 + 350 + 120) = 200 / 670 = 29.8507%
  • What’s the prior probability of the customer being a “Big Sale”?
    • 120 / (200 + 350 + 120) = 120 / 670 = 17.9104%
  • What’s the probability that the customer visited another department given that they are an “Easy Sale” customer?
    • P(Visited Other | Easy Sale)  = 150 / (150 + 200) = 150 / 350 = 42.8571%
  • What’s the probability that the customer did not come alone given that they are a “Big Sale” customer?
    • P(Did Not Come Alone | Big Sale) = 120 / (0 + 120) = 100%

The main point is that we need to calculate all of these “conditional probabilities” along with the prior probabilities for each class.  Here are the results for the example table.

CLASS VISITED OTHER DEPT DID NOT VISIT OTHER DEPT CAME ALONE DID NOT COME ALONE PRIORS
Browser 100.000% 0.000% 37.500% 62.500% 29.851%
Easy Sale 42.857% 57.143% 71.429% 28.571% 52.239%
Big Sale 50.000% 50.000% 0.000% 100.000% 17.910%

Using these results, we can now guess / classify new data.  Let’s say you want to classify a customer you’ve spotted.  They came alone and visited another department.

Now we line up all of the probabilities needed and multiply them together (note: I’m using the three digits after the decimal place to make it easier to copy the example).  This is done due to the naive assumption that each variable is independent of each other.

Are They… VISITED OTHER DEPT CAME ALONE PRIORS
Browser 100% 37.500% 29.851%
Easy Sale 42.857% 71.429% 52.239%
Big Sale 50% 0% 17.910%
  • Results for Browser: 0.111941
  • Results for Easy Sale: 0.159916
  • Results for Big Sale: 0.000000

The best result is “Easy Sale”.  If you want to get an actual probability for each class you would use the result as the numerator and the sum of all of the results as the denominator.

  • Probability for Browser = 0.111941 / (0.111941 + 0.159916 + 0.0) = 41.2%
  • Probability for Easy Sale= 0.159916 / (0.111941 + 0.159916 + 0.0) = 58.8%
  • Probability for Big Sale = 0.000000/ (0.111941 + 0.159916 + 0.0) = 0.0%

You should notice that “Big Sale” seems to be unfairly given a 0% chance just because there has not been an observation where a “Big Sale” customer has never arrived alone.  You can fix this by smoothing the data using laplace smoothing.

Laplace Smoothing

Also called add-one-smoothing, laplace smoothing literally adds one to every combination of category and categorical variable.  This helps since it prevents knocking out an entire class just because of one variable.  For example…

Class Visited Other Dept Did Not Visit Other Dept Came Alone Did Not Come Alone
Browser 200 0 75 125
Easy Sale 150 200 250 100
Big Sale 60 60 0 120

Now you can see that there are a couple zeros.  We fill those gaps by adding one to every cell in the table.

Class Visited Other Dept Did Not Visit Other Dept Came Alone Did Not Come Alone
Browser 201 1 76 126
Easy Sale 151 201 251 101
Big Sale 61 61 1 121

Since we add one to all cells, the proportions are essentially the same.  The more data you have, the smaller the impact the added one will have on your model.

Continuous (Numeric) Values in Naive Bayes

You have two options.

  1. “Discretize” the variables by binning / bucketing the numeric values in to ranges.
  2. Use the Gaussian function to identify the probability of the continuous value.

Making continuous variables discrete, e.g. turning age in to child, adult, and senior, just creates new variables and distinct values.  This is probably the easiest thing to do.

Secondly, you can convert your continuous variables in to the probability under the normal distribution.  Wikipedia has a well written section on this conversion.

Tutorials

Recommended Reading

Excellent example of Naive Bayes on Stackoverflow