Distant Supervision for DIORA

This is the official repo for our EMNLP 2021 paper: Zhiyang Xu, Andrew Drozdov, Jay Yoon Lee, Tim O'Gorman, Subendhu Rongali, Dylan Finkbeiner, Shilpa Suresh, Mohit Iyyer and Andrew McCallum, "Improved Latent Tree Induction with Distant Supervision via Span Constraints".

Setup
Preparation
Training
Evaluation
Related Works
Citation

Setup

Create environment

conda create -n s-diora python=3.6

Install Pytorch

conda install pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.1 -c pytorch

Install requirments

pip install -r requirements.txt

Preparation

Prepare CoNLL2012 dataset
Prepare WSJ Penn Treebank dataset
Prepare MedMentions dataset
Prepare CRAFT dataset

Training

Download pre-trained Diora model

mkdir ./download
cd ./download
wget http://diora-naacl-2019.s3.amazonaws.com/diora-checkpoints.zip
unzip diora-checkpoints.zip

A sample command line to train a model for PTB dataset

python main.py \
    --experiment_name pmi_seed3 \
    --default_experiment_directory ${EXP_DIR}/final_model/avg_pmi \
    --batch_size 1 \
    --accum_steps 1 \
    --validation_batch_size 128 \
    --lr 0.001 \
    --train_data_type wsj_emnlp \
    --train_filter_length 0 \
    --train_path ${DATA_DIR}/ptb/ptb-test-diora.parse \
    --validation_data_type wsj_emnlp \
    --validation_path /mnt/nfs/scratch1/zhiyangxu/co-diora-emnlp2021data/data/ptb-test.jsonl \
    --validation_filter_length 0 \
    --elmo_cache_dir ${DATA_DIR}/elmo \
    --emb elmo \
    --eval_after 1 \
    --eval_every_batch -1 \
    --eval_every_epoch 1 \
    --log_every_batch 100 \
    --max_step -1 \
    --max_epoch 40 \
    --opt adam \
    --save_after 0 \
    --num_warmup_steps 3000 \
    --load_model_path /mnt/nfs/scratch1/zhiyangxu/co-diora/experiment/real-world/log/avg_performance_v2_word2vec_390481271/model.best__parsing__f1.pt \
    --model_config '{"diora": {"normalize": "unit", "outside": true, "size": 400}}' \
    --eval_config '{"parsing": {"name": "eval-k1", "cky_mode": "cky", "enabled": true, "outside": false, "ground_truth": "/mnt/nfs/scratch1/zhiyangxu/co-diora-emnlp2021data/data/ptb-test.jsonl", "write":true, "scalars_key": "inside_s_components"}}' \
    --loss_config '{"reconstruct": {"path": "./resource/ptb_top_10k.txt", "weight": 1.0}}'

Training Args explanation

Command	Values	Description
`--experiment_name`	`str`	Name of the current experiment
`--default_experiment_directory`	`str`	Where to save the experiment
`--batch_size`	`int`	Size of the batch
`--accum_steps`	`int`	Accumulation steps before the optimizer takes a step

Evaludation

Download the best models reported in the paper

Model Type	Performance	Constraints	Dataset
NCBL	60.4	`NER`	WSJ Penn Treebank
MINDIFF	59.0	`NER`	WSJ Penn Treebank
RESCALE	61.9	`NER`	WSJ Penn Treebank
STRUCTURE RAMP	59.9	`NER`	WSJ Penn Treebank
NCBL	58.8	`Gazatteer`	WSJ Penn Treebank
NCBL	57.8	`PMI`	WSJ Penn Treebank
NCBL	56.8	`NER`	CRAFT

RelatedWorks

Citation

@inproceedings{diora2021emnlp,
  title={Improved Latent Tree Induction with Distant Supervision via Span Constraints},
  author={Zhiyang Xu, Andrew Drozdov, Jay Yoon Lee, Tim O'Gorman, Subendhu Rongali, Dylan Finkbeiner, Shilpa Suresh, Mohit Iyyer and Andrew McCallum},
  booktitle={EMNLP},
  year={2021},
}

iesl / distantly-supervised-diora