RongchangZhao / CausalDA

Causal data augmentation for pretraining debiasing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pulling Up by the Causal Bootstraps: Causal Data Augmentation for Pre-training Debiasing

Paper

If you use this code in your research, please cite the following publication: https://arxiv.org/abs/2108.12510

@article{gowda2021pulling,
  title={Pulling Up by the Causal Bootstraps: Causal Data Augmentation for Pre-training Debiasing},
  author={Sindhu C.M. Gowda and Shalmali Joshi and Haoran Zhang and Marzyeh Ghassemi},
  journal={arXiv preprint arXiv:2108.12510},
  year={2021}
}

To replicate the experiments in the paper:

Step 0: Environment and Prerequisites

Run the following commands to clone this repo and create the Conda environment:

git clone git@github.com:MLforHealth/CausalDA.git
cd CausalDA/
conda env create -f environment.yml
conda activate causalda

Step 1: Obtaining the Data

See DataSources.md for detailed instructions to setup the WILDS and CXR datasets. This is not necessary for the synthetic experiments.

Step 2: Running Experiments

To train a single model, e.g.

python train_synthetic.py \
    --type par_back_front \
    --corr-coff 0.75 \
    --test-corr 0.75 \
    --output_dir /path/to/output

or

python train.py \
    --type back \
    --data camelyon \
    --data_type Conf \
    --domains 2 3 \
    --corr-coff 0.95 \
    --seed 0 \
    --output_dir /path/to/output

To reproduce the experiments in the paper by training grids of models, call sweep.py using the class names defined in experiments.py as experiment names, e.g.

python sweep.py launch \
    --experiment CXR \
    --output_dir /my/sweep/output/path \
    --command_launcher "local" 

This command can also be ran easily using launch_scripts/launch_exp.sh. You will likely need to update the launcher to fit your compute environment.

Step 3: Aggregating Results

We provide sample code for creating aggregate results for an experiment in AggResults.ipynb.

Acknowledgements

We make use of code from the WILDS benchmark as well as from the DomainBed framework.

License

This source code is released under the MIT license, included here.

About

Causal data augmentation for pretraining debiasing

License:MIT License


Languages

Language:Jupyter Notebook 91.1%Language:Python 8.8%Language:Shell 0.1%