JShivali/Spam-Classification-using-Naive-Bayes-Classifier

Problem statement

Implement Naive Bayes classifier to classify mails as spam or not spam. I have not considered the grammatical structure of the mails to train the model.

Data

The data files are compressed to distribute them efficiently via github. To extract, type:

tar -xzf data.tar.gz

Accuracy

The accuracy of the model is 86%.

Problem formulation:

I have used the bag of words model to represent every document in terms of unordered bag of words, without considering grammar or context.

The bag of words dictionary has 2 sub dictionaries for spam and not spam. Both the nested dictionaries hold the word and the count in each spam and nonspam emails. The data in training set is cleaned and used to generate the bag of words. bag_of_word={'sp':{w1:count, w2:count},'nsp':{w1:count,w2:count}}
The bag of words is generated by parsing all the files in training data set and extracting words from the mails. In the data cleaning step, we have removed stop words, special characters and numbers from the mail text.
The likelihood probabilities that a word has occured given spam is calculated as number of times the word has occured in spam mails divided by the number of words in spam and number of words in vocabulary. Similary likelihood probabilities given not spam are calculated and stored in likelihood dictionary. likelihood={'sp':{},'nsp':{}}

Challenges:

In some cases, there are words that we encounter in test data but have never seen in the training data. So, that word would have a zero-likelihood probability. Hence making the entire product of likelihoods zero and hence the posterior probability also zero.We have assigned the value of 0.01 in that case.

If a word has occured in non spam mail. Then the likelihood of the word given spam would be zero. To avoid the entire product from being zero, we have used Laplace smoothing. (Ref: https://medium.com/syncedreview/applying-multinomial-naive-bayes-to-nlp-problems-a-practical-explanation-4f5271768ebf)

Assumption:

Probability of occurrence of word is independent of each other.
Stop words like 'from','the','to' etc are not considered for classification because they are equally likely to occur in both spam and not spam.

Implementation

Read, preprocess and modify data from training data set to required format.
Generate bag of words.
Calculate the likelihood probabilities.
Calculate prior probabilities.
Test on test data given.

JShivali / Spam-Classification-using-Naive-Bayes-Classifier