email-spam-classifier natural-language-processing nlp spam-detection support-vector-machines

Email Spam Classifier using SVM

Run the code

Download all the files into a single folder
Open octave and make sure you are in the right directory
Run the "Main.m" file

Technical Details

This project has implemented the following email preprocessing and normalization steps:

• Lower-casing: The entire email is converted into lower case

• Stripping HTML: All HTML tags are removed from the emails.

• Normalizing URLs: All URLs are replaced with the text \httpaddr".

• Normalizing Email Addresses: All email addresses are replaced with the text \emailaddr".

• Normalizing Numbers: All numbers are replaced with the text \number".

• Normalizing Dollars: All dollar signs ($) are replaced with the text \dollar".

• Word Stemming: Words are reduced to their stemmed form.

• Removal of non-words: Non-words and punctuation have been re- moved.

The vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words.

About

Linear classifier using Support Vector Machines (SVM) which can determine whether an email is Spam or not with an accuracy of 98.7%. Used regularization to prevent over-fitting of data. Pre-processed the E-mails using Porter Stemmer algorithm. Used a spam vocabulary to create a Feature Vector for each E-mail. Prints the top 15 predictors of spam

email-spam-classifier natural-language-processing nlp spam-detection support-vector-machines

MIT License

Languages

Language:MATLAB 100.0%