FairEdu: Balancing Biased Data for Fair Modeling in Education

An python repository to perform educational dataset balancing applied in submitted paper in @todo.

Download this repository with git clone or equivalent.

git clone @repo

Requirements

Python 3.8
Tensorflow > 1.5
tensorflow-estimator 2.7.0
tensorflow-macos 2.7.0
tensorflow-metal 0.3.0
Sklearn > 0.19.0

Dataset Balancing, Hardness Bias and Fairness implementation

We detail below how to implement dataset balancing, hard-bias and fairness evaluation. See example code in data_balancing/DbtExample.py

Dataset Balancing

The dataset balancing is implemented by DbtExample.py, function cbt(self, X, Y, G):, where X is input features, Y is input label and G is input demographics. Then to re-sample with the traditional Class Balancing Strategy, simply apply SMOTE from imblearn.over_sampling package: SMOTE().fit_resample(X, Y), or any other class balancing techqniues such as BorderlineSMOTE, NearMiss. To re-sample with demographics, simply add demographic into class parameter and treating the balancing as a multi-class balancing problem, i.e., GY = G.astype(str) + Y.astype(str) and then X, GY = SMOTE().fit_resample(X, GY).

Hardness Bias Evaluation

After generating samples, we evaluate the kDN distribution by distance.jensenshannon The H-bias can be calculated by calKDN function in the DbtExample.py.

Fairness Evaluation

We applied abroca package in ABROCA. A sample calculation of ABROCA:

slice = compute_abroca(abrocaDf, 
        pred_col = 'prob_1' , 
        label_col = 'label', 
        protected_attr_col = 'gender',
        majority_protected_attr_val = '2',
        compare_type = 'binary', # binary, overall, etc...
        n_grid = 10000,
        plot_slices = False)

Datasets & Model implementation detail

We describe below the detailed dataset and model implementation.

Forum Post

The forum dataset is included as Moodle Forum de-identified Embeddings. The CNN-LSTM model was implemented using a tensorflow repo modified from here, and the BERT embeddings of input text were generated using Bert-as-service. We set the input layer dimension as 768 with a sigmoid output layer. The L2 regularizer lambda set to 0.001. For the CNN network, 128 convolution filters with a width of 5 was used. For the LSTM network, 128 hidden states and 128 cell states was used. During training, one cycle policy was used with the batch size of 32, training epochs of 50, maximum learning rate of 2e-05 and dropout probability of 0.5. The shuffling was performed at the end of each epoch and with an early stopping mechanism. After each epoch, 10% of the training data was chosen at random as validation data and the best model was selected based on the validation error.

KDDCUP Student Dropout

The KDDCUP dataset is linked in KDDCUP. We implemeneted this model by Feng. In line with the original implementation, the model had 32*32 deep layers with dropout ratio set to 0.5 and 32 convolution filters with a width of 8. During training, the learning rate was set to 0.001 with Adam optimizer, 10 epoch, 256 batch and logloss type. After each epoch, 10% of the training data was chosen at random to serve as validation data and the best model was selected based on validation error.

Open University Student Performance

The Open University dataste is linked in Open University. The source code is provided in OUA. For the traditional ML model, we applied the GridSearchCV package to find the best model hyperparameters to optimise the performance.

A sample GridSearched model: 
rfc=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
    max_depth=None, max_features=0.5, max_leaf_nodes=None,
    min_impurity_split=1e-07, min_samples_leaf=1,
    min_samples_split=4, min_weight_fraction_leaf=0.0,
    n_estimators=250, n_jobs=1, oob_score=False, random_state=None,
    verbose=0, warm_start=False)

lsha49 / FairCBT