datamining-project
This is a project for FGCU's knowledge discovery and data mining course. It focuses on spam email classification.
Part 1 of the project is an implementation of a spam email classifier from scratch in Java 8, using a Naive Bayes classifier, and also a K-nearest-neighbors classifier.
Part 2 of the project is a pair of experiments in Scikit learn with various other algorithms.
Run instructions:
Ensure that the dataset meets the following criteria:
- All files must be .txt files
- All files must be stored alone in their directory with no other files
- All files be divided between two subdirectories,
test
andtraining
- All spam emails are expected to have a filename beginning with "sp"
Usage:
Part 1:
usage: Classify
data_path
algorithm_name (Either "knn" or "naivebayes")
[k] The K value to use - only used if algorithm is "knn"
Part 2:
usage: python experiments.py
There are no arguments - the program is merely a script which conducts experiments with various values.
Run instuctions:
Part 1:
- Ensure that you have Java JRE 8 installed.
- Compile the program using Gradle:
(From the project's part1 subdirectory)
./gradlew build
The generated Java class files will then be in the /build/classes/java/main directory. (The program can also be compiled like a standard java program if this has problems). - Run the program as
java Classify arguments
The arugments are described in the above section
Part 2:
- Ensure that you have Python 3.7 or greater installed. If you have both Python 2 and Python 3 installed, you will need to ensure you run the program using
python3
. - Ensure that you have pip installed. If you have both Python 2 and Python 3 installed, you will need to run pip commands using
pip3
. - Install scikit-learn:
pip install sklearn
- Install numpy:
pip install numpy
- Run the program as:
python experiments.py path_to_data