This is a sample application built on top of Judgr, a naïve Bayes classifier library written in Clojure.
First, clone this repository and start a REPL by running lein repl
in its root directory.
Then, load the core namespace:
user=> (use 'judgr-spam-demo.core)
nil
This repository comes with a few thousand messages for training and testing the classifier. See License section below for further information.
The following command will train the classifier using the messages
stored in data/training
:
user=> (time (train!))
"Elapsed time: 20983.71 msecs"
This operation might take several seconds to finish.
Choose a few messages from data/testing
and see if the classifier
got them right:
user=> (.classify classifier (slurp "data/testing/TEST_XXXXX.eml"))
:spam
If you are curious about how a specific feature are distributed between spam and ham messages:
user=> (.get-feature (.db classifier) "viagra")
{:feature "viagra", :total 42, :classes {:spam 41, :ham 1}}
Use the judgr.cross-validation
namespace to generate a Confusion
Matrix and analyze the results.
The following example shows how to perform a 10-Fold Cross Validation:
user=> (use 'judgr.cross-validation)
nil
user=> (def conf-matrix (k-fold-crossval 10 classifier))
#'user/conf-matrix
user=> (float (accuracy conf-matrix))
0.9310345
This operation might take several minutes to finish.
The repository includes a subset of the CSDMC2010 SPAM corpus, which is one of the datasets for the data mining competition associated with ICONIP 2010.
Copyright for the text in the messages remains with the original senders.
The complete dataset can be downloaded here.
Copyright (C) Daniel Fernandes Martins
Distributed under the New BSD License. See COPYING for further details.