Implementation of paper Unsupervised Domain Adaptation on Reading Comprehension
- Python 3.6
- Pytorch >= 1.0.0
Maybe other libraries are needed, please check when you try to run.
Totally 6 datasets are included in this paper:
- SQuAD-1.1: Training set, Dev Set
- CNN: The whole dataset
- DailyMail: The whole dataset
- NewsQA: CSV file
- CoQA: Training set Dev set
- DROP: The whole dataset
If you use raw SQuAD dataset, please add an attribute 'len' into original json file which indicates the number of samples in current json file, which should be 87599 for train set and 10570 for dev set.
If you want to use our processed datasets, please refer to processed datasets in SQuAD format and skip step 2 and 3.
We need to obtain the json files for the training set and dev set of all datasets.
The default %YOUR_PATH% and %YOUR_OUT_PATH% are both the root path of current repository
1.CNN and DailyMail: extract the question tgz file into a path, then run
python transform_dataset.py --datatype cnn --path %YOUR_PATH --output_path %YOUR_OUT_PATH%
It will generate the training and dev json file in SQuAD format.
Optional: In our paper, since these two dataset are too big, we uniformly sample a subset from these two datasets. The sample interval for CNN training set is 4, while 8 for the both sets of DailyMail. However, we keep the orignal CNN dev set.
2.NewsQA: You should put newsqa-data-v1.csv and cnn_stories.tgz into %YOUR_PATH%, then run
python transform_dataset.py --datatype newsqa --path %YOUR_PATH% --output_path %YOUR_OUT_PATH%
You will get two files newsqa_train.json and newsqa_dev.json in the output path.
3.CoQA: You need to get the training and dev set of CoQA under a path %YOUR_PATH%, then run
python transform_dataset.py --datatype coqa --path %YOUR_PATH --output_path %YOUR_OUT_PATH%
You will get two files coqa_train.json and coqa_dev.json in the output path.
4.DROP: You need to extract the files into a path %YOUR_PATH%, then run
python transform_dataset.py --datatype drop --path %YOUR_PATH% --output_path %YOUR_OUT_PATH%
You will get two files drop_train.json and drop_dev.json in the directory.
Since the sizes of these two datasets are obviously larger than other datasets, we sample a subsets as the data used in the experiment in the original paper. We use uniform sampling and the new datasets can be obtained via running
python sample_dataset.py %FILE_NAME% %OUT_FILE_NAME% %SAMPLE_RATIO%
In the paper, the sample ratio for DailyMail training and dev set is 8, while 4 for CNN training set. CNN dev set remains unchanged due to its small size.
To make unsupervised domain adaptation on reading comprehension, you need to run the supervised training on the source domain at first. An example for running training
python run.py --bert_model bert-base-uncased --do_train --do_lower_case --train_file cnn_train.json --predict_file cnn_dev.json --output_dir cnn_models --output_model_file best_model.bin --logger_path train_cnn --use_BN
'--do_train' indicates running supervised training. You can modified the parameters for --train_file, --predict_file( the dev file under this mode), --output_dir and --output_model_file under your demand. '--use_BN' means using Batch Normalization in the output layer of BERT. We run this on 2 GTX1080Ti GPU with 22GB under the default setting.
After obtain the supervised training model, you can run adaptation given the source domain training data, the target domain training data and the target domain dev data. An example is given for adaptation from CNN to CoQA using entropy weighted CASe (CASe+E).
python run.py --bert_model bert-base-uncased --do_adaptation --do_lower_case --source_train_file cnn_train.json --target_train_file coqa_train.json --target_predict_file coqa_dev.json --input_dir cnn_models --input_model_file best_model.bin --output_dir cnn2coqa_models --output_model_file CASe_model.bin --logger_path cnn2coqa --CASe_method CASe+E --use_BN
Here, '--do_adaptation' means adaptation mode, you can modify the parameters for '--source_train_file'(training file in the source domain), '--target_train_file'(training file in the target domain), '--target_predict_file'(dev file in the target domain), '--input_dir'(The path of superivsed trained model on the source domain), '--input_model_file'(name of supervised trained model file), 'output_dir'(output path for the adapted model), 'output_model_file'(adapted model file name).
'CASe_method' can be 'CASe' or 'CASe+E', which are standard CASe and CASe with entropy-weighted loss in adversarial learning respectively.
It should be noted that the process will run evaluation on the target dev set with interval depended on parameter '--evaluation_interval', which is just used for watching the performance trend. However, the correct result should be the performance given by final model which will also be logged.
It is used to run prediction for a trained/adapted model. An example is given
python run.py --bert_model bert-base-uncased --do_prediction --do_lower_case --predict_file coqa_dev.json --output_dir cnn2coqa_models --output_model_file CASe_model.bin --logger_path cnn2coqa --use_BN --output_prediction
Here, parameters of '--output_dir' and '--output_model_file' are the model path and model file name for prediction respectively. '--output_prediction' means a json file with predictions will be written to the output path.