rteammco / nlp2015-spam

Spam Email Classification: final project for CS388L (Natural Language Processing) - Spring 2015.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NLP Spam Detection Project

Note: "Split_60_30_10" is a 60-30-10% split of the data: 60% for training the N-Gram models, 30% for training the main classifier on the next 30% of the data (evaluated on the trained N-Gram models), and 10% for testing the main classifier. Rename (or create new) directories appropriately for different data splits.

See Data/DATA_NOTES for specific message ranges for each data split.

Setup

Do all of the following from the base directory (where this README file is located).

  1. Download and extract the Trec 2007 data set into the project directory (link below).
  2. Download and build Weka and the Berkley Language Model 1.1.6 from the links below. Keep the builds in the project directory, or otherwise edit all of the classpaths in the project scripts.
  3. Create the following directories if they do not exist:
    1. mkdir -p Data/NGramTrain/Split_60_30_10/lower_chars
    2. mkdir -p Data/NGramTrain/Split_60_30_10/lower_words
    3. mkdir -p Data/NGramTrain/Split_60_30_10/upper_chars
    4. mkdir -p Data/NGramTrain/Split_60_30_10/upper_words
    5. mkdir -p Data/NGramTest/Split_60_30_10/lower_chars
    6. mkdir -p Data/NGramTest/Split_60_30_10/lower_words
    7. mkdir -p Data/NGramTest/Split_60_30_10/upper_chars
    8. mkdir -p Data/NGramTest/Split_60_30_10/upper_words
    9. mkdir -p Models/Split_60_30_10/Evaluations
  4. Now you are ready to start the preprocessing and experiment steps below.

Bag of Words Preprocessing and Experiments

  1. Use preprocessor.py to generate the filtered bag-of-words training set on the 60-90% data range:
    python preprocessor.py trec07 45253 67877 Data/Split_60_30_10/BoW_bulk_train.arff -stopwords stopwords.txt
    This can also be done using the Condor CondorJobFiles/preprocess submit file.
  2. Similarly, generate the filtered bag-of-words testing set on the remaining 7542 (10%) emails:
    python preprocess.py trec07 67878 75419 Data/Split_60_30_10/BoW_bulk_test.arff -stopwords stopwords.txt
    This can also be done using the Condor CondorJobFiles/preprocess_test submit file.
  3. Run the convert script. This will automatically convert and standardize all the bag-of-words .arff data files generated in the last two steps, assuming they were named correctly:
    Data/Split_60_30_10/BoW_bulk_train.arff -> Data/Split_60_30_10/BoW_std_train.arff
    Data/Split_60_30_10/BoW_bulk_test.arff -> Data/Split_60_30_10/BoW_std_test.arff
    This can also be done using the Condor CondorJobFiles/convert submit file (but change the Java 8 path).

N-Gram Preprocessing and Experiments

  1. Run the generate_ngram_files script. This will call preprocessor.py appropriately to generate all of the n-gram sets from the training data and create separate test files for each message in the test set. It will create extra N-Gram training files for the first 45252 (60%) emails. The remaining 22625 (30%) of training messages will be used for evaluation on the N-Gram models. The files will be stored in the directories created above.
    Four types of sets will be generated: lower_chars, lower_words, upper_chars, and upper_words. The "lower" data means all characters have been converted to lowercase, and "upper" means they have not been converted. The "words" data is to generate the N-Grams for words, whereas the "chars" data is for generating N-Grams on the individual characters in the message instead.
    For both the training set and the test set, each message will be stored individually in its own file and will be unlabeled for evaluation usage. However, for the first 60% of the messages, all emails will additionally be stored in two other separate files - one containing all spam messages and the other containing all ham messages - for the Berkley LM classifier to learn a model from.
    This step can also be done using the Condor CondorJobFiles/preprocess_ngrams submit file.
  2. Run the build_ngram_models script. This will take all of the N-Gram data sets created from the previous step and generate .arpa and .binary model files in the Models/Split_60_30_10 directory. These files are used for evaluating test data against the N-Gram models.
    By default, N (for the N-Gram parameter) is set to 3. You can pass in a numerical argument to the script to change the value of N. You may want to modify the code by setting the types list to only include model types that you want to generate.
    This step can also be done using the Condor CondorJobFiles/build_ngram_models submit file. To change N when using Condor will require modifying the parameters in the submit file.
    NOTE: You will need to modify the Java 8 path in the build_ngram_models script.
  3. Now it's time to run the evaluation on the training data to set up an .arff file for the Weka classifier. Follow these steps:
    1. Edit the ngram_to_weka.config file to add which model types and N-values you wish to use for classification. The existing config file is documented, so follow those instructions.
    2. Run the ngram_to_weka.py script to evaluate all of the messages for each model and N-value:
      python ngram_to_weka.py trec07p Data/NGramTrain/Split_60_30_10 1 45252 Models/Split_60_30_10 config Data/Split_60_30_10/ngram_train.arff. Note that this process will that a very long time to run.
      NOTE: You will need to modify the Java 8 path at the top of the ngram_to_weka.py file.

Data and Tool Resources

This is a list of sources of data and tools.

DATASET: trec07p
http://plg.uwaterloo.ca/~gvcormac/treccorpus07/

TOOL: Weka 3.6.12
http://www.cs.waikato.ac.nz/ml/weka/downloading.html

TOOL: Berkley Language Model 1.1.6
https://code.google.com/p/berkeleylm/
Download: svn checkout http://berkeleylm.googlecode.com/svn/trunk/ berkeleylm-1.1.6

About

Spam Email Classification: final project for CS388L (Natural Language Processing) - Spring 2015.


Languages

Language:Python 69.6%Language:Shell 30.4%