bobowang2333 / juliaCode

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

README

We implemented the Overhead-aware Learning for Fair and Approximate Optimal Decision Tree Classifier, which is encoded in Julia 1.5.1 along with the Gurobi Solver 9.03.

Installation

First download and install Julia and open the Julia notebook. From Julia, JuMP is installed by built-in package manager.

import Pkg
Pkg.add("JuMP")

We also use Gurobi Solver with a JuMP model, as the following:

import Pkg
Pkg.add("Gurobi")
using JuMP
using Gurobi
model = Model(Gurobi.Optimizer)

The Gurobi solver offers free academic license and you just need to register your educational account.

Running Instruction

Here we document all the running instructions to learn a fair decision tree by iteratively invoking the solver. In addition, we will also compare with the greedy algorithm (CART) to learn the decision tree.

Dataset Preprocessing

The input datasets are pretty arbitrary, the value may range over boolean, real, categorical domains. We will transform the original dataset to the boolean dataset and present two different processing approaches. First, for the categorical value, we adopt the one-hot encoding to generate multiple boolean columns for the replacement. Second, for the read-valued features, we require user to specify the config file denoting the range to split for each continuous-valued features.

python ./Dataset/transDataset.py original_dataset configuration_file sensitive_index 0 thres

After executing the above command, it will output the booleanized dataset in the name "boolean.data". The following shows one example of the configuration file (in Germany Dataset):

1 12 24 36 48
4 3000 6000 9000
12 20 40 60

The format of configuration file: each line of the file corresponds to each feature, where the first number denote the index of that feature (e.g. 4th feature), and the remaining three values (e.g. 3000, 6000, 9000) represent three thresholds which will be used to split the original feature column.

Next, we will generate the CSV file (with header) and sample part of the datasets as our training set, where the input is "boolean.data".

python ./Dataset/data2CSV.py boolean.data output_CSV sample_para test_CSV

In the above command, if sampl_para is 0, which means that we do not sample the data and generate the origianl dataset in CSV form. Otherwise, if sample_para is 1, which would sample the data in 1/50 size. (50 is the current setting, and we can modify it later on)

MIP-Iter Running

We will talk about how to encode the decision tree learning in mixed integer optimization framework and run the encoding iteratively to select the splitting features. Here we name the training dataset as test1.csv before running the script.

./runEncodingHighNew.sh -f 1 -r RunningPath -j juliaBinary -s senID -n nlevel -N featureNum

In the above, -f denotes the number of dataset where we start to build the tree. For instance, we use test1.csv as the training dataset, so we should input -f 1. As the learner keeps spliting the tree, the bash script will automatically generate test2.csv, test3.csv,...etc. -j denotes the julia binary position. -s denotes the index of the sensitive feature and -n denotes the number of levels in each MIP iteration.

para Description
-f the index of the training dataset (the start)
-j the path of the Julia Binary
-s the index of the sensitive feature
-n the number of levels which will be used to encode the learner
-N the number of features of the dataset
-r running path, usually is the path of the dataset
_-r_ indicates the running path where it should include the following files:
File Name Description
countCSV.py Calculate the statistics of the dataset to provide information to generate the encoding
genJulia The binary file which takes the statistics of the original dataset as input, and outputs the MIP encoding julia file with the name of level.jl
readRes Binary file which reads the output of the Gurobi solver and outputs the splitting feature information
splitGroupHigh.py Split the current dataset by the splitting feature at each iteration

The following command is an example to run the script.

./runEncodingHighNew.sh -d 1 -r ~/Downloads/juliaCode/Dataset/germany -j /Applications/Julia-1.5.app/Contents/Resources/julia/bin -s 13 -n 2 -N 20

The above command will also output the result file treeRes.txt, it documents the mapping from branch to the splitting feature. After finishing the running, we may also want to analyze the result of the learner (treeRes.txt) by analyzing its accuracy and fairness value.

python readResQuan.py treeRes.txt trainSet testSet OutputCSV

The above command takes the learner result, training set and test set to generate the predicted result for the test set. The format of the output file is in CSV form. Here, trainSet denotes the path of the training set, same for testSet. The following command will calculates the DTDI value to measure the discrimination level (the lower, the less disciminative)

python getDTDI.py OutputCSV 0 1

Compare with CART

We may also want to compare with the standard CART algorithms and calculate its discrimination value as well, which can be achieved by running the following command. Here, we assume the input dataset is in the .data format (same as the format of Germany Dataset), each row of which is the feature vector of each example.

python getDTDI.py output_CSV senID 0

In the above command, senID repsents the index of the sensitive feature. After committing this command, we will get the output DTDI value of CART-learned tree.

Quantiative Dataset

We can accept not only the boolean dataset but also the quantiative dataset. For the quantiative datset processing, we also handle them in two ways. First, for categorical value, we also adopt the one-hot encoding. Second, for continoused value, we normalize the feature value of that column to range [0,1] as the following: (assume we are handling feature F)

F_v = ( F_v - F_min ) / ( F_max-F_min )

python ./Dataset/transDataset.py original_dataset 0 sensitive_index 1

Running the above command, we can transform the original dataset to the quantiative dataset where each feature value is from the range [0,1].

Note

Before we execute the whole running script, we have to delete the treeRes.txt (in the running path).

You can run the following command to finish the MIP-iter running in one shot for quantiative dataset:

./runWhole.sh -f /Users/jingbow/Downloads/juliaCode/Dataset/testScript/german.data -c /Users/jingbow/Downloads/juliaCode/Dataset/testScript/config -s 12 -r /Users/jingbow/Downloads/juliaCode/Dataset/testScript -j /Applications/Julia-1.5.app/Contents/Resources/julia/bin -N 20 -n 2 -Q 1

-f represents the path of the specific dataset, -c represents the directory of the dataset, -s represents the sensitive ID, -r represents the directory of the specific dataset, -j represents the path of Julia binary, -N represents the number of features, -n represent the number of tree levels in the Encoding and -Q represents to run the Encoding in quantiative dataset or boolean dataset.

You can run the following command to finish the MIP-iter running in one shot for boolean dataset:

./runWhole.sh -f /Users/jingbow/Downloads/juliaCode/Dataset/testScript/german.data -c /Users/jingbow/Downloads/juliaCode/Dataset/testScript/German.config -s 12 -r /Users/jingbow/Downloads/juliaCode/Dataset/testScript -j /Applications/Julia-1.5.app/Contents/Resources/julia/bin -N 20 -n 2 -Q 0

About


Languages

Language:Julia 50.3%Language:C++ 27.2%Language:Shell 13.2%Language:Python 9.4%