An python repository to perform educational dataset balancing applied in submitted paper in @todo.
Download this repository with git clone
or equivalent.
git clone @repo
- Python 3.8
- Tensorflow > 1.5
- tensorflow-estimator 2.7.0
- tensorflow-macos 2.7.0
- tensorflow-metal 0.3.0
- Sklearn > 0.19.0
We detail below how to implement dataset balancing, hard-bias and fairness evaluation. See example code in data_balancing/DbtExample.py
The dataset balancing is implemented by DbtExample.py
, function cbt(self, X, Y, G):
, where X is input features, Y is input label and G is input demographics. Then to re-sample with the traditional Class Balancing Strategy, simply apply SMOTE from imblearn.over_sampling
package: SMOTE().fit_resample(X, Y)
, or any other class balancing techqniues such as BorderlineSMOTE, NearMiss. To re-sample with demographics, simply add demographic into class parameter and treating the balancing as a multi-class balancing problem, i.e., GY = G.astype(str) + Y.astype(str)
and then X, GY = SMOTE().fit_resample(X, GY)
.
After generating samples, we evaluate the kDN distribution by distance.jensenshannon
The H-bias can be calculated by calKDN
function in the DbtExample.py
.
We applied abroca
package in ABROCA. A sample calculation of ABROCA:
slice = compute_abroca(abrocaDf,
pred_col = 'prob_1' ,
label_col = 'label',
protected_attr_col = 'gender',
majority_protected_attr_val = '2',
compare_type = 'binary', # binary, overall, etc...
n_grid = 10000,
plot_slices = False)
We describe below the detailed dataset and model implementation.
The forum dataset is included as Moodle Forum de-identified Embeddings. The CNN-LSTM model was implemented using a tensorflow repo modified from here, and the BERT embeddings of input text were generated using Bert-as-service. We set the input layer dimension as 768 with a sigmoid output layer. The L2 regularizer lambda set to 0.001. For the CNN network, 128 convolution filters with a width of 5 was used. For the LSTM network, 128 hidden states and 128 cell states was used. During training, one cycle policy was used with the batch size of 32, training epochs of 50, maximum learning rate of 2e-05 and dropout probability of 0.5. The shuffling was performed at the end of each epoch and with an early stopping mechanism. After each epoch, 10% of the training data was chosen at random as validation data and the best model was selected based on the validation error.
The KDDCUP dataset is linked in KDDCUP. We implemeneted this model by Feng. In line with the original implementation, the model had 32*32 deep layers with dropout ratio set to 0.5 and 32 convolution filters with a width of 8. During training, the learning rate was set to 0.001 with Adam optimizer, 10 epoch, 256 batch and logloss type. After each epoch, 10% of the training data was chosen at random to serve as validation data and the best model was selected based on validation error.
The Open University dataste is linked in Open University. The source code is provided in OUA. For the traditional ML model, we applied the GridSearchCV package to find the best model hyperparameters to optimise the performance.
A sample GridSearched model:
rfc=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features=0.5, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=4, min_weight_fraction_leaf=0.0,
n_estimators=250, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)