Reproducing score

Question

Reproducing score

spookyQubit opened this issue 3 years ago · comments

Hi @xinyadu , thanks a lot for sharing the code. I am trying to reproduce the scores from Table 1 of the paper. I am grateful for your documentation and followed the steps outlined in this repo as closely as possible and am able to run the experiment all the way through. But the score I get at the end of the experiment does not match the scores reported in Table 1.

The steps I took were:

1. Environment setup/download spacy model/download

$ git clone https://github.com/xinyadu/doc_event_role.git
$ cd doc_event_role
$ touch environment.yml  # add the requirements in this file
$ cat environment.yml 
name: muc_seq_acl
channels:
  - defaults
dependencies:
  - python==3.5.6
  - pip==20.1.1
  - spacy==2.0.12
  - cudatoolkit==9.2
  - pip:
    - torch==0.4.1
    - pytorch-pretrained-bert==0.6.2
    - typing==3.7.4.3
$ conda env create -f environment.yml
$ conda activate muc_seq_acl

2. Download spacy

$ python3 -m spacy download en_core_web_sm

3. Download Glove

Download glove.6B.100d.txt in doc_event_role/model/code/utils/glove.6B.100d.txt

4. Train

$ cd model/code
$ mkdir data_seq_tag_pairs
$ python gen_seq_tag_pairs.py --div train
$ mkdir model_save
$ mkdir model_out
$ python main.py --config config/example.config
$ python seq_to_extracts.py --seqfile model_out/multi_bert.out  # Generate preds.json

The Dev scores I see during the training are:

Dev: time: 193.15s, speed: 15.20st/s; p: 28.1324, r: 70.0204, f: 40.1383
Dev: time: 124.92s, speed: 15.35st/s; p: 62.9229, r: 39.4083, f: 48.4639
Dev: time: 124.47s, speed: 15.36st/s; p: 66.1268, r: 43.9025, f: 52.7702
Dev: time: 133.18s, speed: 15.39st/s; p: 52.8802, r: 66.1970, f: 58.7939
Dev: time: 126.86s, speed: 15.37st/s; p: 63.4102, r: 48.3327, f: 54.8542
Dev: time: 124.93s, speed: 15.33st/s; p: 65.0108, r: 44.1431, f: 52.5823
Dev: time: 123.39s, speed: 15.36st/s; p: 74.8237, r: 37.4043, f: 49.8758
Dev: time: 125.21s, speed: 15.34st/s; p: 72.0788, r: 46.6892, f: 56.6702
Dev: time: 131.71s, speed: 15.31st/s; p: 57.4742, r: 69.6806, f: 62.9915
Dev: time: 129.72s, speed: 15.36st/s; p: 60.7873, r: 62.0204, f: 61.3977
Dev: time: 136.96s, speed: 15.33st/s; p: 47.4962, r: 73.2321, f: 57.6211
Dev: time: 128.23s, speed: 15.33st/s; p: 63.9869, r: 58.7949, f: 61.2811
Dev: time: 131.07s, speed: 15.36st/s; p: 58.6679, r: 62.1833, f: 60.3745
Dev: time: 126.07s, speed: 15.35st/s; p: 65.7156, r: 50.6334, f: 57.1970
Dev: time: 123.92s, speed: 15.33st/s; p: 72.3295, r: 41.9134, f: 53.0724

5. Test scores

$ python eval.py --goldfile ./data/processed/test.json --predfile ./model/code/pred.json

================Exact Match=================
Prec, Recall, F-1
PerpInd
55.1020 39.8649 46.2611
PerpOrg
40.9091 57.1429 47.6821
Target
45.3744 63.4483 52.9105
Victim
46.5753 71.5789 56.4315
Weapon
44.8980 72.1311 55.3459
MACRO average:
46.5718 60.8332 52.7557
===============Head Noun Match===============
Prec, Recall, F-1
PerpInd
59.5745 42.5676 49.6552
PerpOrg
47.7612 65.4762 55.2330
Target
60.2804 78.6207 68.2397
Victim
49.6503 72.6316 58.9815
Weapon
48.9362 75.4098 59.3548
MACRO average:
53.2405 66.9412 59.309

What I am not sure about is, does this score correspond to the Multi-Granularity Reader score from Table 1 of the paper or does this correspond to some other row? Was the config file used to run the Multi-Granularity Reader experiment different from example.config?

刘威甫 · Answer 1 · Fri Oct 08 2021 14:59:29 GMT+0800 (China Standard Time)

II have the same question, is your problem solved?

xinyadu · Answer 2 · Sat Oct 09 2021 07:44:49 GMT+0800 (China Standard Time)

The score should correspond to the model proposed in the paper.

narayanan0202 · Answer 3 · Thu Jun 16 2022 20:50:42 GMT+0800 (China Standard Time)

Why f1 scores, precision and recall are coming as -1? I ran the code for just one iteration.

Seed num: 42
MODEL: train
Load pretrained word embedding, norm: False, dir: utils/glove.6B.100d.txt
Embedding:
     pretrain word:400000, prefect match:9915, case_match:0, oov:2271, oov%:0.18634610650693362
Training model...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
DATA SUMMARY START:
 I/O:
     Start   Sequence   Laebling   task...
     Tag          scheme: BIO
     Split         token:  ||| 
     MAX SENTENCE LENGTH: 200
     MAX   WORD   LENGTH: -1
     Number   normalized: False
     Word  alphabet size: 12187
     Char  alphabet size: 53
     Label alphabet size: 12
     Word embedding  dir: utils/glove.6B.100d.txt
     Char embedding  dir: None
     Word embedding size: 100
     Char embedding size: 30
     Norm   word     emb: False
     Norm   char     emb: False
     Train  file directory: ../data_seq_tag_pairs/train_full
     Dev    file directory: ../data_seq_tag_pairs/dev_full
     Test   file directory: ../data_seq_tag_pairs/test
     Raw    file directory: None
     Dset   file directory: model_save/multi_bert.dset
     Model  file directory: model_save/multi_bert
     Loadmodel   directory: model_save/multi_bert.best.model
     Decode file directory: model_out/multi_bert.out
     Train instance number: 9409
     Dev   instance number: 0
     Test  instance number: 1112
     Raw   instance number: 0
     FEATURE num: 0
 ++++++++++++++++++++++++++++++++++++++++
 Model Network:
     Model        use_crf: True
     Model word extractor: LSTM
     Model       use_char: False
 ++++++++++++++++++++++++++++++++++++++++
 Training:
     Optimizer: SGD
     Iteration: 1
     BatchSize: 5
     Average  batch   loss: False
 ++++++++++++++++++++++++++++++++++++++++
 Hyperparameters:
     Hyper              lr: 0.015
     Hyper        lr_decay: 0.05
     Hyper         HP_clip: None
     Hyper        momentum: 0.0
     Hyper              l2: 1e-08
     Hyper      hidden_dim: 200
     Hyper         dropout: 0.4
     Hyper      lstm_layer: 1
     Hyper          bilstm: True
     Hyper             GPU: True
DATA SUMMARY END.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
build sequence labeling network...
word feature extractor:  LSTM
use crf:  True
build word sequence feature extractor: LSTM...
build word representation...
100% 407873900/407873900 [00:07<00:00, 56807815.17B/s]
build CRF...

Epoch: 0/1
 Learning rate is set as: 0.015
Shuffle: first input word list: [2, 379, 761, 9, 607, 920, 542, 830, 2393, 7616, 470, 38, 410, 165, 101, 161, 47, 1919, 30, 751, 49, 33, 1351, 103, 185, 34, 119, 120, 42, 547, 2, 800, 101, 542, 8, 473, 378, 38, 7616, 136, 137, 1491, 538, 9, 257, 631, 412, 10751, 47, 410, 3866, 42, 3395, 3107, 5459, 47, 410, 741, 2464, 82, 10752, 42, 82, 103, 9774, 3549, 38, 412]
/content/drive/MyDrive/LDP/Granularity/doc_event_role/model/code/model/crf.py:92: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:28.)
  masked_cur_partition = cur_partition.masked_select(mask_idx)
/content/drive/MyDrive/LDP/Granularity/doc_event_role/model/code/model/crf.py:97: UserWarning: masked_scatter_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/cuda/IndexKernel.cpp:62.)
  partition.masked_scatter_(mask_idx, masked_cur_partition)
/content/drive/MyDrive/LDP/Granularity/doc_event_role/model/code/model/crf.py:246: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:28.)
  tg_energy = tg_energy.masked_select(mask.transpose(1,0))
/content/drive/MyDrive/LDP/Granularity/doc_event_role/model/code/model/crf.py:159: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/cuda/Indexing.cu:967.)
  cur_bp.masked_fill_(mask[idx].view(batch_size, 1).expand(batch_size, tag_size), 0)
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py:175: UserWarning: masked_scatter_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/cuda/IndexKernel.cpp:62.)
  allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py:175: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/cuda/Indexing.cu:967.)
  allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py:175: UserWarning: indexing with dtype torch.uint8 is now deprecated, please use a dtype torch.bool instead. (Triggered internally at  ../aten/src/ATen/native/IndexingUtils.h:28.)
  allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
     Instance: 500; Time: 41.48s; loss: 33301.9307; acc: 32481.0/35907.0=0.9046
     Instance: 1000; Time: 43.34s; loss: 17721.0560; acc: 67413.0/73419.0=0.9182
     Instance: 1500; Time: 41.94s; loss: 9585.8511; acc: 101886.0/109830.0=0.9277
     Instance: 2000; Time: 41.54s; loss: 6581.4640; acc: 135592.0/145601.0=0.9313
     Instance: 2500; Time: 42.53s; loss: 4454.5928; acc: 170963.0/182673.0=0.9359
     Instance: 3000; Time: 43.69s; loss: 3632.2205; acc: 206277.0/219703.0=0.9389
     Instance: 3500; Time: 41.45s; loss: 3705.0391; acc: 240547.0/255728.0=0.9406
     Instance: 4000; Time: 42.23s; loss: 3311.3256; acc: 275859.0/292731.0=0.9424
     Instance: 4500; Time: 43.09s; loss: 3200.9517; acc: 309967.0/328569.0=0.9434
     Instance: 5000; Time: 41.54s; loss: 3095.8545; acc: 344268.0/364550.0=0.9444
     Instance: 5500; Time: 42.88s; loss: 3549.6006; acc: 379845.0/402111.0=0.9446
     Instance: 6000; Time: 43.32s; loss: 2930.5935; acc: 414651.0/438744.0=0.9451
     Instance: 6500; Time: 41.57s; loss: 2906.9250; acc: 448489.0/474379.0=0.9454
     Instance: 7000; Time: 42.23s; loss: 2729.6758; acc: 483452.0/511023.0=0.9460
     Instance: 7500; Time: 42.82s; loss: 3256.7528; acc: 518431.0/547845.0=0.9463
     Instance: 8000; Time: 42.00s; loss: 2533.6289; acc: 553023.0/584109.0=0.9468
     Instance: 8500; Time: 41.26s; loss: 2746.0062; acc: 586990.0/619886.0=0.9469
     Instance: 9000; Time: 41.76s; loss: 2765.5623; acc: 621129.0/655800.0=0.9471
     Instance: 9409; Time: 35.17s; loss: 2204.6501; acc: 649257.0/685466.0=0.9472
Epoch: 0 training finished. Time: 795.83s, speed: 11.82st/s,  total loss: 114213.68103027344
totalloss: 114213.68103027344
Dev: time: 4.87s, speed: 0.00st/s; p: -1.0000, r: -1.0000, f: -1.0000
!!!Exceed previous best f score: -10
Save current best model in file: model_save/multi_bert.0.model



MODEL: decode
Load Model from file:  model_save/multi_bert
build sequence labeling network...
word feature extractor:  LSTM
use crf:  True
build word sequence feature extractor: LSTM...
build word representation...
build CRF...
Predict test result has been written into file. model_out/multi_bert.out