nowshad-sust / enron-sender-detection

A simple machine learning approach to detect the sender based on the mail body of the famous Enron Datasset

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

enron-sender-detection

A simple machine learning approach to detect the sender based on the mail body of the famous Enron Datasset

prerequisites

  • python
  • anaconda
  • scikit learn
  • other dependencies

How to run

  1. clone this repository - git clone https://github.com/nowshad-sust/enron-sender-detection.git
  2. download enron dataset from here - https://www.cs.cmu.edu/~./enron/enron_mail_20150507.tgz
  3. now extract this dataset(maildir) to the project(clonned) folder
  4. create a folder named remail in the project directory
  5. open a terminal or cmd in the project directory
  6. run the copy_sent_mails.py script by the command - python copy_sent_mails.py This should make a directory named remail in the project folder and copy all the sent mails from the original dataset directory.
  7. Now, run the naive_bayes_pipeline.py by the command - python naive_bayes_pipeline.py This should give you a number which refers to the validation sucess rate.

Latest Statistics (accuracy)

  • Naive Bayes classifier ~ 0.46
  • SVM ~ 0.79
  • SVM with grid search ~ 0.85

About

A simple machine learning approach to detect the sender based on the mail body of the famous Enron Datasset

License:MIT License


Languages

Language:Python 100.0%