williamcfrancis / Email-Spam-Classifier-using-SVM

Linear classifier using Support Vector Machines (SVM) which can determine whether an email is Spam or not with an accuracy of 98.7%. Used regularization to prevent over-fitting of data. Pre-processed the E-mails using Porter Stemmer algorithm. Used a spam vocabulary to create a Feature Vector for each E-mail. Prints the top 15 predictors of spam

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Email Spam Classifier using SVM

Run the code

  1. Download all the files into a single folder
  2. Open octave and make sure you are in the right directory
  3. Run the "Main.m" file

Technical Details

This project has implemented the following email preprocessing and normalization steps:

• Lower-casing: The entire email is converted into lower case

• Stripping HTML: All HTML tags are removed from the emails.

• Normalizing URLs: All URLs are replaced with the text \httpaddr".

• Normalizing Email Addresses: All email addresses are replaced with the text \emailaddr".

• Normalizing Numbers: All numbers are replaced with the text \number".

• Normalizing Dollars: All dollar signs ($) are replaced with the text \dollar".

• Word Stemming: Words are reduced to their stemmed form.

• Removal of non-words: Non-words and punctuation have been re- moved.

The vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words.

About

Linear classifier using Support Vector Machines (SVM) which can determine whether an email is Spam or not with an accuracy of 98.7%. Used regularization to prevent over-fitting of data. Pre-processed the E-mails using Porter Stemmer algorithm. Used a spam vocabulary to create a Feature Vector for each E-mail. Prints the top 15 predictors of spam

License:MIT License


Languages

Language:MATLAB 100.0%