This is the github for the corona psychopathology
To download either download it via. github or go to the terminal and type to download it to the location in the terminal
git clone https://github.com/centre-for-humanities-computing/corona-psychopathology
Before you start you might neeed to create a virtual environment in anaconda or similar. If you have multiple version of python installed please make sure to use python3
instead of python
in the following commands.
To create the test and train set run the following, to run the script create_test_train.py
$ python create_test_train.py --data test_data.csv --text_column text --label_column labels --perc_test 0.3 --resample over
This should create two files called train.csv
and test.csv
with 30% of the data in the test set being in the test set and the data resample according the resample statement, which can be:
over
random oversampling to match the majority categoryunder
random undersampling to match the majority distributionsmote
(NEW) Synthetic Minority Oversampling Technique (SMOTE)- leave out if you don't want to resample at all
The reminder of the arguments are:
--data
the data to split into a train and test dataset--text_column
the column indicating text. Default istext
--label_column
the columns indicating labels. Default islabels
--perc_test
percantage of data which should be the test set
In general I recommend using the grid search, but one might want to test out specific classifiers so I made this the this case.
It is called from the terminal in the following way, but I do encourage the reader to check out the script as well
$ python classify.py --classifier nb
which should print out:
Performance using the classifier nb, was found to be:
acc_train: 0.9594
acc_test: 0.9312
Naturally you can choose many more options these include:
--classifier
the classifier to be used options include:nb
: naive bayesrf
: random forestab
: Adaboostxg
XGboost, Note that this requires and external package which needs to be installeden
Elastic Net
--use_tfidf
should the preprocessing use tf-idf (alternative is Bag-of-words). Default isTrue
--lowercase
should the text be lowercased. Default isTrue
--binary
should the text be binarized, e.g. only detect whether a word (or n-gram) appear rather than using count. Default isFalse
--ngrams
n-grams used, options includeunigram
,bigram
,trigram
and4gram
. Default isbigram
. Note that is also uses lower levels n-gram as well, e.g. bigram also uses unigrams.--min_wordcount
minimum wordcount required for a word to be included in the model. Default is2
--max_vocab
maximum desired vocabulary choosen among the most frequent words, which isn't removed by other conditions. Default isNone
, e.g. no maximum size is set--max_perc_occurance
a word is removed it is appears in X percent of the document. Can be considered a corpus specific stopwordlist. Default is0.5
--clf_args
arguments to be passed to the classifier given as a dictionary. Default is{}
, an empty dictionary--scoring
the scoring methods of the classifier. Default is "accuracy" for a full list see this site under 'classification'. Recommended include 'roc_auc', 'f1' and 'balanced_accuracy'.
This is the meat of the scripts. Beware of not overdoing it, if you fit all the models at once this will take a long time.
The simplest use case is:
$ python grid_search.py --clfs nb ab
Which will print:
Calling grid search with the arguments:
data: train.csv
clfs: ['nb', 'ab']
resampling: [None]
grid_search_clf: True
grid_seach_vectorization: True
cv: 5
text_column: text
label_column: labels
Fitting 5 folds for each of 24 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done 26 tasks | elapsed: 38.1s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 2.8min finished
Fitting 5 folds for each of 48 candidates, totalling 240 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done 26 tasks | elapsed: 2.0min
[Parallel(n_jobs=-1)]: Done 176 tasks | elapsed: 14.3min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 21.5min finished
The grid search is completed the results were:
Using the resampling method: None
The best fit of the clf: nb, obtained a score of 0.9082, with the parameters:
tfidf__use_idf = True
vect__binary = True
vect__lowercase = True
vect__max_df = 0.5
vect__min_df = 2
vect__ngram_range = (1, 2)
The best fit of the clf: ab, obtained a score of 0.9127, with the parameters:
clf__base_estimator = DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=2, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
tfidf__use_idf = True
vect__binary = True
vect__lowercase = True
vect__max_df = 0.5
vect__min_df = 2
vect__ngram_range = (1, 4)
This is a bit much, let's break it down. The first simply gives you of which argument have been given and how the function is being called. Note that if the number of fits is extremely high it is probably ideal to try with less testing. e.g. only using grid search for vectorization or for the classifiers or only using 3 cross validation folds. The last bit shows the results of the grid search, which parameter were the best for each model. We will look into this in the next section.
You will see that this doesn't use the test.csv
but you can test the best function using the classify.py
above (see example in next section).
The possible arguments are:
--data
the desired data. Default istrain.csv
, i.e. can typically left out if the data is already split usingcreate_test_train.py
--clfs
the classifiers to use, same options asclassify.py
. Can be multiple.--grid_search_clf
should classifiers hyperparemeters be grid searched. Default isTrue
--grid_seach_vectorization
should vectorization hyperparemeters be grid searched. Vectorization is the method of turning the text into a bag-of-word and/or tf-idf. Default isTrue
.--cv
number of cross validation folds. Default is5
--resampling
same ascreate_test_train.py
but can be a list. See to following code for an example. Note that this is only relevant if the train test data isn't resampled--text_column
same ascreate_test_train.py
--label_column
same ascreate_test_train.py
--scoring
the scoring methods of the classifier. Default is "accuracy" for a full list see this site under 'classification'. Recommended include 'roc_auc', 'f1' and 'balanced_accuracy'.--sampling_strategy
(NEW) The degree of sampling. Specified as the desired proportion of the minority to the majority. I.e. 1 means that you want equal ammount of minority and majority data, 0.5 mean that you want 0.5 minority datapoints to each majority datapoint (ratio 2:1).
The advanced use case is:
python grid_search.py --clfs nb en rf ab xg en --grid_search_clf True --grid_seach_vectorization True --cv 5
Note that here is used the train data and used the resampling. Naturally this assumed that the train data is not resampled. Basically this is a way to test the influence of the resampling and which approach is best. Warning: This will take a long time to run, but it should run through all the variations I have specified for each model combined with each method of vectorization. Maybe a bit too exhaustive as the method of vectorization is probably going to good for similar models.
Lastly, the variables used for the grid search is hidden with the script and are defined as:
search_params_vect
search parameter for the vectorizationsearch_params_tfidf
search parameter for the tfidf (e.g. whether it should or should use idf)search_params_clf
these are the search parameters for the classifiers You are free to change these in the script, however note that the grid search increase exponentially so I recommend starting adding thing slowly and removing things if things take too long. There is a lot of experimentation to do here, which is likely to influence results.
This used the classify.py as mentioned above and the output from the first chunk of code for the grid search. We tested two model in the above code. Let's check how they perform on the test data. Starting with the nb the output were:
The best fit of the clf: nb, obtained a score of 0.9082, with the parameters:
tfidf__use_idf = True
vect__binary = True
vect__lowercase = True
vect__max_df = 0.5
vect__min_df = 2
vect__ngram_range = (1, 2)
All of these as measure of the vectorization as the naive bayes classifier have no hyperparameter to optimize using grid search. This can also be seen the the prefix (vect__*
and tfidf__*
). These results mean that the best results were founds with the following setup:
- using tfidf (
tfidf__use_idf = True
) - using a binary classifer (
vect__binary = True
) - using lowercase (
vect__lowercase = True
) - removing words which appeared in 50% of the documents (
vect__max_df = 0.5
) which is correspond tomax_perc_occurance
in theclassify.py
arguments - removing words which only appear once in the training data. (
vect__min_df
) (Note that this is the training data in the cross validated dataset e.g. a subet oftrain.csv
) - the best performance was seen using bigrams
Note that not all options are checked. To see exactyl which ones are checked you will have to examine the script and it might be ideal to change these to explore different things.
The way to check this using classify.py
would then be to:
python classify.py --classifier nb --max_perc_occurance 0.5 --binary True --lowercase True --use_tfidf True --ngrams bigram --min_wordcount 2
Let's also examine the case of the Adaboost:
The best fit of the clf: ab, obtained a score of 0.9127, with the parameters:
clf__base_estimator = DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=2, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
tfidf__use_idf = True
vect__binary = True
vect__lowercase = True
vect__max_df = 0.5
vect__min_df = 2
vect__ngram_range = (1, 4)
You can see that the only argument tested for the classifier was base estimator (the only one with the clf__*
prefix). Which is a bit long, but it essentially says that the best base estimator tested was a decision tree with a depth of 2 (the ones tested was decision trees with a depth of 1 and 2)
For the reminder we see that it performaned the best with 4 grams and otherwise the same as above. This could be written
python classify.py --classifier ab --max_perc_occurance 0.5 --binary True --lowercase True --use_tfidf True --ngrams 4gram --min_wordcount 2 --clf_args '{"base_estimator":"DecisionTreeClassifier(max_depth=2)"}'
The syntax for the argument clf_args is a bit exhaustive, but it is similar to json and dictionaries. Feel free to let me know if there is any issues.
Final note
You if you want to adjust more to the scripts you can create you own script and import the function. Such an example could e.g. look like:
from create_test_train import split_to_train_test
# now it is possible to call the imported function similar to how you do with any package:
split_to_train_test(...)