Spam Filter

Simple spam filter trained with multinomial Naive Bayes.

Corpus

SpamAssassin and Enron-Spam was used when developing this spam filter. Expecting directory structure with files of e-mails in plain text. Every e-mail will have its headers (except for the subject) removed and any existing HTML-tags or URLs in the body will also be removed.

Results

When combining SpamAssassin and Enron-Spam into one corpus it appears to perform almost too good in terms of precision and recall. However, the performance drops when training on Enron-Spam and evaluating on SpamAssassin.

The Enron-Spam corpus is collected from 150 employees working at Enron and SpamAssassin is a collection of donated e-mails. It could be that Enron-Spam does not give enough variation for the spam filter to generalize.

The feature methods used for better results are only minor, the best improvement comes from bi-grams.

Potential problems and future works

There appear to be e-mails that are encrypted or with attachments that are not removed but could potentially affect the results.

The feature vector dimensions are huge and most likely need to have them reduced for better results. It could also be interesting to use SVM over Naive Bayes.

Lastly it would also be interesting to explore another corpus.

Dependencies

BeautifulSoup is used to remove HTML-tags (used with lxml) and sklearn is used for machine learning specific tasks.

Run

With the assumption that you have the datasets available as specified in train.py and have set the desired random seed in ml.py.

$ python train.py

mharrys / spam-filter

Spam Filter

Corpus

Results

Potential problems and future works

Dependencies

Run

About

Languages