The purpose of this repository is to explore text classification methods in education forum post classification. This repo is resulted from this paper: Which Hammer should I Use? A Systematic Evaluation of Approaches for Classifying Educational Forum Posts. We kindly ask you to reference this paper when applying this repo.
We also include here the Stanford forum post dataset: stanfordMOOCForumPostsSet.tar.gz. This dataset was originally created and labelled by the Stanford financed by the National Science Foundation to support educational research. The dataset can be used for research purpose.
Forum post classification is a long-standing task in the field of educational research. To help ease the effort and aid future research, here we provide commonly used ML (Machine Learning) and DL (Deep Learning) model implemenatation code. (note: DL code is modified from repo in here: https://github.com/zackhy/TextClassification, while text preprocessing partially used code by https://medium.com/@bedigunjit/)
- Naive bayes: ml_classifiers.nb_clf
- Logistic regression: ml_classifiers.lr_clf
- Random forest: ml_classifiers.rf_clf
- Support vector machine: ml_classifiers.svm_clf
- CLSTM: clstm_classifier
- BLSTM: rnn_classifier
- Python 3.x
- Tensorflow > 1.5
- Sklearn > 0.19.0
ML code is contained in Traditional_Machine_Learning.ml_classifiers
create configuration: config = dict() config['testSize'] = 0.2 config['file'] = 'xxx.csv'
-
initialise base classifier: classifier = ml_clf(config)
-
create a Naive bayes classifier: classifier.nb_clf()
-
create a SVM classifier: classifier.svm_clf()
-
create a Logistic regression classifier: classifier.lr_clf()
-
create a Random forest classifier: classifier.rf_clf()
-
to perform a simple grid search with pre-defined parameters: grid = GridSearchCV(YOUR_MODEL,YOUR_SEARCH_PARAMS,refit=True,verbose=2) grid.fit(self.Train_X,self.Train_Y) print(grid.best_estimator_)
-
to run a model with hyperparamter, replace model function and add parameter: e.g., replace: rfc=RandomForestClassifier()
with: rfc=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features=0.5, max_leaf_nodes=None, min_impurity_split=1e-07, min_samples_leaf=1, min_samples_split=4, min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
We used a service called "Bert-as-a-service" (https://github.com/hanxiao/bert-as-service) to generate BERT embeddings of the forum post. The embedding is then used as input for DL models
We refer this repo: https://github.com/zackhy/TextClassification, where DL code was modified from.
- model is in
xxx_classifier.py
- run python
train.py
to train the DL model - run python
test.py
to do test.