allennlp-distributed-training

This repo holds code/config for running example AllenNLP experiments with DistributedDataParallel support. The training_config directory has two versions of the same set of experiments. The ones in distributed_data_parallel directory mostly differs with the dataset readers. The dataset readers are replicas of the original ones in AllenNLP, with a minor modification to support distributed sampling.

To run the distributed experiments install AllenNLP from this fork @ https://github.com/scarecrow1123/allennlp/tree/torch-distributed.

conda create -n allennlp_distributed python=3.7
conda activate allennlp_distributed
git clone https://github.com/scarecrow1123/allennlp
cd allennlp
git checkout -b ddp-3 
pip install .

allennlp train training_config/distributed_data_parallel/esim.jsonnet --include-package distributed-training -s output/

To run without distributed setup, do the usual AllenNLP installation and use experiments in training_config/ data_parallel/

Speed Comparison: Time taken to train one epoch (averaged over 3 epochs)

GPU - 2080 Ti

NOTE: The time reported does not correspond to the training_duration metric. This is the time taken by the Trainer._train_epoch method.

Experiment	Single GPU	2x Data Parallel	2x Distributed	4x Data Parallel	4x Distributed
esim.jsonnet (400K SNLI samples)	4m 15s	NA	NA	4m 30s	2m 13s
bidaf.jsonnet	5m 44s	NA	NA	4m 10s	2m 5s

jiangxiluning / allennlp-distributed-training

allennlp-distributed-training

Speed Comparison: Time taken to train one epoch (averaged over 3 epochs)

About

Languages