import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.base import ClassifierMixin
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from scipy.stats import norm
Personal notes for myself about Naive Bayes Classifier and how to implement it from scratch
1. Introduction
Naive Bayes is a probabilistic model based on the Bayes’ Theorem. It can be used to perform classification tasks.
1.1. Bayes’ Theorem
The Bayes theorem is a way of linking the conditional probabilities of two events \(A\) and \(B\). The conditional probability of \(A\) given \(B\) is defined as
\[ P(A | B) = \dfrac{ P(A \cap B) }{ P(B) }. \]
Similarly, the conditional probability of \(B\) given \(A\) is defined as
\[ P(B | A) = \dfrac{ P(B \cap A) }{ P(A) } = \dfrac{ P(A \cap B) }{ P(A) } \]
We can solve in the second equation for \(P(A \cap B)\) and substitute in the first one to obtain the Bayes’ Theorem
\[ P(A | B) = \dfrac{ P(B | A) \cdot P(A) }{ P(B) } \]
1.2. Classification using Bayes’ Theorem
We want to use the Bayes’ Theorem to classify a class \(C\) given a vector of features \(X\), so we formulate the problem as
\[ P(C | X = (x_1, x_2, ..., x_n)) = \dfrac{ P(X = (x_1, x_2, ..., x_n) | C) \cdot P(C) }{ P(X) } \]
A vector of features \(X = (x_1, x_2, ..., x_n)\) whose class is unknown will be classified as the class that maximizes the probability \(P(C | X = (x_1, x_2, ..., x_n))\) such that
\[ C = \underset{C}{\operatorname{argmax}} \ P(C | X = (x_1, x_2, ..., x_n)) \]
Since we are trying to find the value of C
that maximizes \(P(C | X = (x_1, x_2, ..., x_n))\) we can avoid computing the marginal probability \(P(X)\).
\[ C = \underset{C}{\operatorname{argmax}} \ P(C | X = (x_1, x_2, ..., x_n)) \propto \underset{C}{\operatorname{argmax}} \ P(X = (x_1, x_2, ..., x_n) | C) \cdot P(C), \]
1.3. Naive assumption
To compute \(P(C)\) we can simply get the frequency of the class \(C\) in the data using the ratio of favorable outcomes of \(C\) divided by the total number of outcomes. To compute the \(P(X = (x_1, x_2, ..., x_n) | C)\) we would need to estimate the joint distribution, which would mean computing \(2^n - 1\) (-1 to remove the empty possibility) probabilities for each possible combination of \(X=(x_1, x_2, ..., x_n)\). One approach to simplify this computation is to assume the features are conditionally independent so the complexity becomes linear
\[ P(X = (x_1, x_2, ..., x_n) | C) = P(x_1 | C) \cdot P(x_2 | C) \cdot ... \cdot P(x_n | C) = \prod_{i=1}^{n} P(x_i | C). \]
1.4. Further optimizations
So far, we have obtained the following expression to compute the posterior probabilities
\[ P(C | X = (x_1, x_2, ..., x_n)) \propto P(x_1 | C) \cdot P(x_2 | C) \cdot ... \cdot P(x_n | C) \cdot P(C) = P(C) \cdot \prod_{i=1}^{n} P(x_i | C). \]
Since probabilities are real numbers between 0 and 1 the product of probabilities may yield zeroed values. We can solve this issue by taking the logarithm to obtain the log-probabilities
\[ \log(P(C | X = (x_1, x_2, ..., x_n))) \propto \log(P(x_1 | C) \cdot P(x_2 | C) \cdot ... \cdot P(x_n | C) \cdot P(C)). \]
One of the important properties of logarithms is \(\log(a \cdot b) = \log(a) + \log(b)\), so
\[ \log(P(C | X = (x_1, x_2, ..., x_n))) \propto \log(P(x_1 | C)) + \log(P(x_2 | C)) + ... + \log(P(x_n | C)) + \log(P(C)) = \log(P(C)) + \sum_{i=1}^{n} \log(P(x_i | C)). \]
A vector of features \(X\) will be assigned the class \(C\) that maximizes this log-prob:
\[ C = \underset{C}{\operatorname{argmax}} \ \log(P(C)) + \sum_{i=1}^{n} \log(P(x_i | C)). \]
The very last thing we need is defining \(P(x_i | C)\). Here, we can plug pretty much any probability distribution but for this example we are going to use the Gaussian distribution:
\[ P(x_i | C) = \dfrac{ 1 }{ \sqrt{2 \pi \sigma_{C}^2} } \cdot \exp^{ -\dfrac{ (x_{i} - \mu_{C}) }{ 2 \sigma_{C}^2 } } \]
2. Coding Naive Bayes from scratch
We first generate with the help of Scikit-Learn a synthetic dataset to perform the classification on.
0) np.random.seed(
= make_classification(n_samples=500, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=10) X, y
= plt.subplots(1, 1, figsize=(12, 6))
fig, ax
0], X[:, 1], c=y)
ax.scatter(X[:, True, alpha=.3)
ax.grid("Dataset", fontsize=24)
ax.set_title("Feature $x_{1}$", fontsize=20)
ax.set_xlabel("Feature $x_{2}$", fontsize=20)
ax.set_ylabel(=20) ax.tick_params(labelsize
We are going to code the classifier following the Scikit-Learn API for fit-predict
class GaussianNaiveBayes(ClassifierMixin):
def __init__(self):
self.gaussian_params = {}
self.class_probabilities = {}
def fit(self, X, y):
= {}
class_probabilities = {}
gaussian_params
= np.unique(y)
class_labels for c_label in class_labels:
= y == c_label
mask = X[mask]
features
# compute the probability of C
= mask.sum() / len(mask)
c_freq = c_freq
class_probabilities[c_label]
# compute the parameters of the Gaussian distribution
= features.mean(axis=0)
mean = features.std(axis=0)
std
= {}
gaussian_params[c_label] "mean"] = mean
gaussian_params[c_label]["std"] = std
gaussian_params[c_label][
self.gaussian_params = gaussian_params
self.class_probabilities = class_probabilities
def predict(self, X):
= self.gaussian_params
gaussian_params = self.class_probabilities
class_probabilities
= np.array(list(class_probabilities.keys()))
class_labels = np.empty(shape=(X.shape[0], len(class_labels)))
predictions
for i, c_label in enumerate(class_labels):
= gaussian_params[c_label]["mean"]
mean = gaussian_params[c_label]["std"]
std = np.log(norm.pdf(X, loc=mean, scale=std)).sum(axis=1)
p_xi_given_C = p_xi_given_C + np.log(class_probabilities[c_label])
predictions[:, i]
return np.argmax(predictions, axis=1)
We test our model using a K-fold validation approach of 5 splits. We are going to compute the ROC AUC score to measure the classification goodness (we assume TPR and FPR are equally important and thus, this metric is appropriate).
= KFold(n_splits=5) kf
= []
roc_scores for train_idx, test_idx in kf.split(X):
= GaussianNaiveBayes()
model
= X[train_idx]
X_train = y[train_idx]
y_train
= X[test_idx]
X_test = y[test_idx]
y_test
model.fit(X_train, y_train)= model.predict(X_test)
y_pred
= roc_auc_score(y_true=y_test, y_score=y_pred)
score roc_scores.append(score)
Our naive model performs really well for this dataset, with a mean of .92 area under the curve.
np.array(roc_scores).mean()
0.927491169202372
3. Strengths and Weaknesses
A summary of strenghts and weaknesses of this model.
Pros
- Very easy to implement and train.
- Computational complexity of \(O(n)\), with \(n\) being the number of samples.
- Good class estimator.
- Works well with small datasets.
- Seems to perform well on text classification.
Cons
- Assumes conditionally independent features, which may lead to poor performance in a dataset with correlated features.
- Not very good probability estimator.
- Limited at capturing complex relationships between the features.