This project implemeted Naive Bayes Model, one of Machine Learning Algorithms based on probability, as part of my first project. Our API follows scikit-learn library, on account of its clarity and simplicity. More detail of scikit-learn API can be found here.
Model for both categorical and continuous features data was implemented. However, combined dataframe has not been supported yet.
The Bayes Theorem:
Where
However, this condition is strong, especially when the dimension of data is large,
And finally, Naive Bayes Classification:
Above description is apply for categorical data. With the Gaussian Naive Bayes, we need another assumtion, distribution of
The python module model/naivebayes.py
contains model for categorical and numerical data, CategoricalNB and GaussianNB correspondingly. You need to import this module to your main file, in order to run this project.
We also provide a demonstrated jupyter notebook, which you can follow as a sample to execute.
In this section, we describe some main function of our implementation, both model for categorical and numerical data share this API. Since the API follows scikit-learn library, it is composed of two main funciton fit(X, y) and predict(X).
-
fit(X, y): Fit model according to X, y.
Parameters:
X: array of shape (n_samples, n_features)
y: array of shape (n_samples, )
This function will calculate prior probabilities of given data. More detailed,$P(h)$ and$P(d_i|h) \ \forall i=1...n, h \in H$ for CategoricalNB,$P(h)$ and$\mu$ and$\sigma \ \forall i=1...n, h \in H$ for Gaussian. -
predict(X): Based on calculated statistics of data, this function can predict new observation.
Parameters:
X: array of shape (n_samples, n_features)
Return:
C: array of shape (n_samples, ) prediction for X
As mentioned above, we provide a demonstrated jupyter notebook as a sample and it is also our experimental execution.
To evaluate, Naive Bayes of scikit-learn was used as standard. The evaluation was done under three popular dataset,
Iris,
Breast Cancer
and Wine.
About overall performance, our model achived 96.67%, 92.1% and 97.22% accuracy score on three datasets respectively.
The agreement ratio (similar predictions devided by total) between our model and scikit-learn are in turn 100%, 96.49% and 100%.