DP-Rewrite consists of command-line tools (later APIs) to easily perform differentially private text rewriting on a given dataset. Implemented with PyTorch, our library provides models and out-of-the-box datasets for running experiments with differential privacy to provide rigorous privacy guarantees. Concretely, we provide tools for pre-training a text rewriting model (most commonly an autoencoder), rewriting a dataset with differential privacy using this pre-trained model, and running downstream experiments on original and rewritten datasets. Additionally, our library allows for the seamless incorporation of custom datasets and custom models into the framework.
Please use the following citation
@InProceedings{Igamberdiev.2022.COLING,
title = {{DP-Rewrite: Towards Reproducibility and Transparency
in Differentially Private Text Rewriting}},
author = {Igamberdiev, Timour and Arnold, Thomas and
Habernal, Ivan},
publisher = {International Committee on Computational
Linguistics},
booktitle = {Proceedings of the 29th International Conference
on Computational Linguistics},
pages = {(to appear)},
year = {2022},
address = {Gyeongju, Republic of Korea}
}
$ sudo apt-get install python3-dev
$ pip install -r requirements.txt
Steps to running an experiment:
- Choose a mode (
pretrain
,rewrite
, ordownstream
, see Modes below for more details). - Choose a model. Currently only
adept
is available for pre-training and rewriting, andbert_downstream
for downstream classification. - Choose a desired dataset (see Custom Datasets below if you want to use your own dataset).
- Specify a name for your experiment with
--name
. This will be the folder name associated with your experiment where checkpoints, configuration info, learning curves, various statistics and rewritten data will be saved to. - Specify two main directories: (1)
--output_dir
where the above folder will be saved and (2)asset_dir
where datasets will be loaded from. See Directories below for more information. - Finally, you can configure additional hyperparameters for the experiment such as
--batch_size
and--learning_rate
. For a complete list of available arguments, you can runpython main.py -h
.
For sample experiment runs, please look at the shell scripts provided in sample_scripts
. These contain test commands covering the most important basic arguments.
Additionally, if you are on a machine with low resources, you can specify --local True
, setting --local_iter_size
to the desired number of iterations per epoch to run the model.
Please read the short introduction about the three available modes below.
There are 3 modes of experiments:
pretrain
: This is used for training/fine-tuning an autoencoder model on some data.rewrite
: This is used for loading a checkpoint of an autoencoder model (provide the checkpoint with paramlast_checkpoint_path
) and running it over a whole data set (= rewriting the dataset).downstream
: This is used for training and/or evaluating a downstream model on some data. Can be done on the raw datasets we provide (see below) or on rewritten data (use parameters--dataset custom
and provide the associated path with--custom_train_path
, as well as optionally--custom_valid_path
and--custom_test_path
). More information in Custom Datasets below.
Optionally, it is possible to specify a pre-defined experiment with --experiment
which simplifies the above Quick Start process by pre-selecting parameters associated with a given experiment (e.g. mode, model, etc.).
If using a pre-defined experiment, any conflicting parameters specified will be overridden.
The specific list of parameters can be seen in prepare_specific_experiment
of utils.py
. The currently available experiments are outlined below:
Experiment name | Description |
---|---|
adept_l2norm_pretrain |
Our implementation of the original pre-training procedure used in ADePT |
adept_l1norm_pretrain |
Modified ADePT pre-training procedure based on https://arxiv.org/abs/2109.03175 |
adept_l2norm_rewrite |
Our implementation of the original rewriting procedure used in ADePT |
adept_l1norm_rewrite |
Modified ADePT rewriting procedure from https://arxiv.org/abs/2109.03175 |
All available datasets will be downloaded and preprocessed automatically to fit the required CVS format.
Dataset Name | Description | Reference |
---|---|---|
imdb |
Dataset commonly used for binary sentiment classification. Split: 25,000 train, 25,000 test examples. | Original source: https://ai.stanford.edu/~amaas/data/sentiment/ Fetched from: https://huggingface.co/datasets/imdb Paper citation: Maas et al., 2011 |
atis |
Airline Travel Information Systems (ATIS) dataset, originally consisting of audio recordings and manual transcripts related to flight reservations, meant for the tasks of intent detection and slot filling. Here we include the manual transcripts and intent labels of the dataset. Split: 4,478 train, 500 development and 893 test examples. | Original source: https://github.com/Microsoft/CNTK/tree/master/Examples/LanguageUnderstanding/ATIS/Data Paper citation: Dahl et al., 1994 |
snips_2016 |
Dataset collected for the Snips personal assistant, used for an NLU benchmark from 2016. No pre-defined dataset split, 328 examples in total. | Original source: https://github.com/sonos/nlu-benchmark/tree/master/2016-12-built-in-intents Fetched from: https://huggingface.co/datasets/snips_built_in_intents |
snips_2017 |
Dataset collected from the Snips personal voice assistant, used for an NLU benchmark from 2017. Split: 13,084 train, 700 development and 700 test examples. | Original source: https://github.com/sonos/nlu-benchmark/tree/master/2017-06-custom-intent-engines Paper citation: Coucke et al., 2018 |
drugscom_reviews_rating |
Original dataset consists of patient reviews on specific drugs and patients' related conditions, along with a rating out of 10 stars on overall satisfaction. Processed here to only include 'review' and 'rating' columns from the original dataset. Split: 161,297 train, 53,766 test examples. | Source: https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29 Paper citation: Gräßer et al. 2018 |
drugscom_reviews_condition |
Original dataset consists of patient reviews on specific drugs and patients' related conditions, along with a rating out of 10 stars on overall satisfaction. Processed here to only include 'review' and 'condition' columns from the original dataset. Split: 161,297 train, 53,766 test examples. | Source: https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29 Paper citation: Gräßer et al. 2018 |
reddit_mental_health |
Dataset consisting of posts from 28 subreddits from 2018-2020, with 15 of the subreddits being from mental health support groups. Processed here to only include 'subreddit' and 'post' columns from the original dataset. Split: 6,467,760 train. | Source: https://zenodo.org/record/3941387 Paper citation: Low et al. 2020 |
amazon_reviews_books |
Large dataset consisting of customer reviews on products from Amazon.com from 1995 until 2015. Here the subset 'Books_v1_00' is used, consisting of reviews in the Books product category. Processed here to only include 'star_rating' and 'review_body' columns. Split: 10,319,090 train. | Original source: https://s3.amazonaws.com/amazon-reviews-pds/readme.html Fetched from: https://huggingface.co/datasets/amazon_us_reviews |
wikipedia |
Currently using the Wikipedia dump from May 1, 2020. | Fetched from: https://huggingface.co/datasets/wikipedia |
openwebtext |
Dataset created as an open-source version of WebText from OpenAI. | Original source: https://skylion007.github.io/OpenWebTextCorpus/ Fetched from: https://huggingface.co/datasets/openwebtext |
If you want to use your own dataset:
- Please specify
--dataset custom
, along with--custom_train_path
(also optionally--custom_valid_path
and--custom_test_path
). - If
--custom_valid_path
is not provided, training and validation sets will be prepared from the training path, with split ratio according to--train_split_ratio
. - Datasets should be in CSV format, with one column for 'text', and optionally a second column for 'labels'. If
--prepend_labels
is True, labels will be prepended to each row's text forpretrain
andrewrite
modes. - Test data is not used for the
pretrain
mode.
Model Name | Description | Author |
---|---|---|
adept |
Our reference implementation of Adept. As the Adept paper lacks info about the exact network structure, some architecture decisions were made by us. | Timour |
Model Name | Description | Author |
---|---|---|
bert_downstream |
A BERT model with a feedforward head (output dimension based on the number of classes in the dataset). The specific type of transformer can be specified with --transformer_type . |
Timour |
It is also possible to include a custom model for the experiments. The procedure is as follows:
- Depending on whether the model is RNN- or Transformer-based, fill out the
CustomModel_RNN
orCustomModel_Transformer
classes, respectively. These can be found inmodels/autoencoders/custom.py
. - Alternatively, a user-made model class can inherit from these custom classes.
- The expected inputs/outputs of the
forward
method are specified in the docstrings of these custom model classes.CustomModel_RNN
:- Input: Dictionary of input values with key-value pairs as follows:
input_ids
:torch.LongTensor
(batch_size
Xseq_len
)lengths
:torch.LongTensor
(batch_size
)teacher_forcing_ratio
:float
- Return:
torch.DoubleTensor
(batch_size
Xseq_len-1
Xvocab_size+4
)
- Input: Dictionary of input values with key-value pairs as follows:
CustomModel_Transformer
:- Input: Dictionary of input values with key-value pairs as follows:
input_ids
:torch.LongTensor
(batch_size
Xseq_len
)attention_mask
:torch.LongTensor
(batch_size
Xseq_len
)
- Return:
torch.DoubleTensor
(batch_size
Xseq_len-1
Xvocab_size
)
- Input: Dictionary of input values with key-value pairs as follows:
- Any new arguments required for the model that are not currently part of the framework can be added with the CMD argument
--custom_model_arguments
, which will be available as a list under the custom model classself.config.custom_config_list
variable.- E.g.
--custom_model_arguments alpha beta gamma
-->self.config.custom_config_list == ['alpha', 'beta', 'gamma']
- E.g.
assets
sub-folders contain the (raw) data of training datasets and embeddings.models
autoencoders
contains all (PyTorch) autoencoder models that are used to privatize the data, each in its own python file.downstream_models
contains all (PyTorch) models that tackle a downstream task, each in its own python file. They can be used with original or rewritten data.
sample_scripts
contains tests for different datasets and experiments.download.py
can be used to download datasets and embeddings to the assets folder.dataload.py
is the main dataloading script which prepares existing and custom datasets.experiments.py
contains three types of "experiments" and the main training/evaluation loops for the framework.main.py
can be executed from the command line with arguments to run experiments (see Quick Start above).preprocessing.py
contains preprocessing pipelines that transform the data according to a model's needs. Separate pipelines are implemented for transformer and RNN-based models.settings.py
specifies all the arguments thatmain.py
can be run with.utils.py
contains additional helper functions and classes (e.g.EarlyStopping
)
If you want to use the default folders in the project structure for data, logs, stats, and so on, you don't need to
specify any paths for using autoencoders and you only need to specify embed_dir_unprocessed
if you want to train/use
an RNN-based model with pre-trained embeddings (--embed_type
as glove
or word2vec
).
If you want to load the project in an IDE, it's a good idea to set output_dir
& asset_dir
to paths outside the project folders, as some IDEs will take very long to index large amounts of data that may be written into these folders.
Output directories:
output_dir
: Stats & logs (defaults to<project_root>/results
)dump_dir
: Directory for things that might need much storage (e.g., model checkpoints & rewritten data) (defaults tooutput_dir
)- In each of these directories, data will be written to a custom experiment name (if none is given, this is set to the current timestamp)
Input directories:
asset_dir
: Where to look for assets like datasets and processed embeddings (defaults to<project_root>/assets
)embed_dir_unprocessed
: Where to look for pre-trained word embeddings when using RNN-based models.- Specify the actual path of the embeddings, not the parent folder.
- 'Unprocessed' refers to the original word embedding models, not modified for specific experiments (e.g. subsampled vocabularies based on frequency in a given dataset). The 'processed' embedding directory is automatically created under
asset_dir
during preprocessing.
data_dir
: If you store your data somewhere else than the other assets, you can specify the path here (defaults toasset_dir
)
All paths should be specified as global paths.