Note: "Split_60_30_10
" is a 60-30-10% split of the data: 60% for training the N-Gram models, 30% for training the main classifier on the next 30% of the data (evaluated on the trained N-Gram models), and 10% for testing the main classifier. Rename (or create new) directories appropriately for different data splits.
See Data/DATA_NOTES
for specific message ranges for each data split.
Do all of the following from the base directory (where this README file is located).
- Download and extract the Trec 2007 data set into the project directory (link below).
- Download and build Weka and the Berkley Language Model 1.1.6 from the links below. Keep the builds in the project directory, or otherwise edit all of the classpaths in the project scripts.
- Create the following directories if they do not exist:
mkdir -p Data/NGramTrain/Split_60_30_10/lower_chars
mkdir -p Data/NGramTrain/Split_60_30_10/lower_words
mkdir -p Data/NGramTrain/Split_60_30_10/upper_chars
mkdir -p Data/NGramTrain/Split_60_30_10/upper_words
mkdir -p Data/NGramTest/Split_60_30_10/lower_chars
mkdir -p Data/NGramTest/Split_60_30_10/lower_words
mkdir -p Data/NGramTest/Split_60_30_10/upper_chars
mkdir -p Data/NGramTest/Split_60_30_10/upper_words
mkdir -p Models/Split_60_30_10/Evaluations
- Now you are ready to start the preprocessing and experiment steps below.
- Use preprocessor.py to generate the filtered bag-of-words training set on the 60-90% data range:
python preprocessor.py trec07 45253 67877 Data/Split_60_30_10/BoW_bulk_train.arff -stopwords stopwords.txt
This can also be done using the CondorCondorJobFiles/preprocess
submit file. - Similarly, generate the filtered bag-of-words testing set on the remaining 7542 (10%) emails:
python preprocess.py trec07 67878 75419 Data/Split_60_30_10/BoW_bulk_test.arff -stopwords stopwords.txt
This can also be done using the CondorCondorJobFiles/preprocess_test
submit file. - Run the convert script. This will automatically convert and standardize all the bag-of-words .arff data files generated in the last two steps, assuming they were named correctly:
Data/Split_60_30_10/BoW_bulk_train.arff -> Data/Split_60_30_10/BoW_std_train.arff
Data/Split_60_30_10/BoW_bulk_test.arff -> Data/Split_60_30_10/BoW_std_test.arff
This can also be done using the CondorCondorJobFiles/convert
submit file (but change the Java 8 path).
- Run the generate_ngram_files script. This will call
preprocessor.py
appropriately to generate all of the n-gram sets from the training data and create separate test files for each message in the test set. It will create extra N-Gram training files for the first 45252 (60%) emails. The remaining 22625 (30%) of training messages will be used for evaluation on the N-Gram models. The files will be stored in the directories created above.
Four types of sets will be generated: lower_chars, lower_words, upper_chars, and upper_words. The "lower" data means all characters have been converted to lowercase, and "upper" means they have not been converted. The "words" data is to generate the N-Grams for words, whereas the "chars" data is for generating N-Grams on the individual characters in the message instead.
For both the training set and the test set, each message will be stored individually in its own file and will be unlabeled for evaluation usage. However, for the first 60% of the messages, all emails will additionally be stored in two other separate files - one containing all spam messages and the other containing all ham messages - for the Berkley LM classifier to learn a model from.
This step can also be done using the CondorCondorJobFiles/preprocess_ngrams
submit file.
- Run the build_ngram_models script. This will take all of the N-Gram data sets created from the previous step and generate .arpa and .binary model files in the
Models/Split_60_30_10
directory. These files are used for evaluating test data against the N-Gram models.
By default, N (for the N-Gram parameter) is set to 3. You can pass in a numerical argument to the script to change the value of N. You may want to modify the code by setting thetypes
list to only include model types that you want to generate.
This step can also be done using the CondorCondorJobFiles/build_ngram_models
submit file. To change N when using Condor will require modifying the parameters in the submit file.
NOTE: You will need to modify the Java 8 path in thebuild_ngram_models
script. - Now it's time to run the evaluation on the training data to set up an .arff file for the Weka classifier. Follow these steps:
- Edit the
ngram_to_weka.config
file to add which model types and N-values you wish to use for classification. The existing config file is documented, so follow those instructions. - Run the
ngram_to_weka.py
script to evaluate all of the messages for each model and N-value:
python ngram_to_weka.py trec07p Data/NGramTrain/Split_60_30_10 1 45252 Models/Split_60_30_10 config Data/Split_60_30_10/ngram_train.arff
. Note that this process will that a very long time to run.
NOTE: You will need to modify the Java 8 path at the top of thengram_to_weka.py
file.
- Edit the
This is a list of sources of data and tools.
DATASET: trec07p
http://plg.uwaterloo.ca/~gvcormac/treccorpus07/
TOOL: Weka 3.6.12
http://www.cs.waikato.ac.nz/ml/weka/downloading.html
TOOL: Berkley Language Model 1.1.6
https://code.google.com/p/berkeleylm/
Download: svn checkout http://berkeleylm.googlecode.com/svn/trunk/ berkeleylm-1.1.6