thu-spmi / damd-multiwoz

Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context, AAAI 2020.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

damd+DST+multi-action data augmentation result.

zlinao opened this issue · comments

Hi, thanks for releasing the source code.
I tried to reproduce the result of damd+DST+multi-action data augmentation (the last row of Table-2 in the paper). However, I got the much higher result, so I am wondering if I used the gold action during the evaluation, as mentioned in the repo:

Generation results are saved in result.csv under the experiment path. Note that only generated system actions are saved and the system action that overlaps most with the ground truth system action is selected for response generation. Automatic metrics of this response is reported, which can be used to show whether our model captures the reference action in the dataset within the multiple actions generated. However, since this response is generated with an access to the ground truth system action, the scores are meaningless themselves.

I run the following command: python model.py -mode train -cfg seed=775 cuda_device=0 exp_no=aug_sample3 batch_size=60 multi_acts_training=True multi_act_sampling_num=3 enable_dst=True bspn_mode=bspn

this is the result from the log:

vocab_path_train : ./data/multi-woz-processed/vocab
vocab_path_eval : experiments/all_aug_sample3_sd775_lr0.005_bs60_sp5_dc3/vocab
data_path : ./data/multi-woz-processed/
data_file : data_for_damd.json
dev_list : data/multi-woz/valListFile.json
test_list : data/multi-woz/testListFile.json
dbs : {'attraction': 'db/attraction_db_processed.json', 'hospital': 'db/hospital_db_processed.json', 'hotel': 'db/hotel_db_processed.json', 'police': 'db/police_db_processed.json', 'restaurant': 'db/restaurant_db_processed.json', 'taxi': 'db/taxi_db_processed.json', 'train': 'db/train_db_processed.json'}
glove_path : ./data/glove/glove.6B.50d.txt
domain_file_path : data/multi-woz-processed/domain_files.json
slot_value_set_path : db/value_set_processed.json
multi_acts_path : data/multi-woz-processed/multi_act_mapping_train.json
exp_path : experiments/all_aug_sample3_sd775_lr0.005_bs60_sp5_dc3/
log_time : 2020-03-28-10-26-32
mode : train
cuda : True
cuda_device : [0]
exp_no : aug_sample3
seed : 775
exp_domains : ['all']
save_log : True
report_interval : 5
max_nl_length : 60
max_span_length : 30
truncated : False
vocab_size : 3000
embed_size : 50
hidden_size : 100
pointer_dim : 6
enc_layer_num : 1
dec_layer_num : 1
dropout : 0
layer_norm : False
skip_connect : False
encoder_share : False
attn_param_share : False
copy_param_share : False
enable_aspn : True
use_pvaspn : False
enable_bspn : True
bspn_mode : bspn
enable_dspn : False
enable_dst : True
lr : 0.005
label_smoothing : 0.0
lr_decay : 0.5
batch_size : 60
epoch_num : 100
early_stop_count : 5
weight_decay_count : 3
teacher_force : 100
multi_acts_training : True
multi_act_sampling_num : 3
valid_loss : score
eval_load_path : experiments/all_aug_sample3_sd775_lr0.005_bs60_sp5_dc3/
eval_per_domain : False
use_true_pv_resp : True
use_true_prev_bspn : False
use_true_prev_aspn : False
use_true_prev_dspn : False
use_true_curr_bspn : False
use_true_curr_aspn : False
use_true_bspn_for_ctr_eval : False
use_true_domain_for_ctr_eval : False
use_true_db_pointer : False
limit_bspn_vocab : False
limit_aspn_vocab : False
same_eval_as_cambridge : True
same_eval_act_f1_as_hdsa : False
aspn_decode_mode : greedy
beam_width : 5
nbest : 5
beam_diverse_param : 0.2
act_selection_scheme : high_test_act_f1
topk_num : 1
nucleur_p : 0.0
record_mode : False
model_path : experiments/all_aug_sample3_sd775_lr0.005_bs60_sp5_dc3/model.pkl
result_path : experiments/all_aug_sample3_sd775_lr0.005_bs60_sp5_dc3/result.csv
multi_gpu : False
model_parameters : 1985700

-------------------------- All DOMAINS --------------------------
[DST] joint goal:39.7 slot acc: 94.1 slot f1: 82.1 act f1: 50.6
[DST] [not eval name slots] joint goal:52.7 slot acc: 95.7 slot f1: 85.7
[DST] [not eval book slots] joint goal:42.5 slot acc: 95.2 slot f1: 81.8
[DST] [not eval name & book slots] joint goal:58.2 slot acc: 96.8 slot f1: 86.4
[CTR] match: 77.3 success: 66.4 bleu: 19.0
[CTR] address: 85.4; phone: 87.1; postcode: 88.4; reference: 95.5; id: 95.8
[DOM] accuracy: single 95.1 / multi: 12.1 (66)

Hi zlinao,

Thanks for trying our code.

Under your experimental setting, the gold actions are not used because they are never used when aspn_decode_mode=greedy. As for the better results you got, I just checked the experimental records for the last row of Table-2, and found that we mistakenly set multi_acts_training=False in our experiments. This indicates that the model for the last row should be "DAMD" instead of "DAMD + multi-action data augmentation" in our paper. Therefore the improvement you get should source from the multi-action data augmentation. Sorry for the inconvenience.

Feel free to contact if you have any other questions.

Yichi

Hi Yichi,
Thanks for your reply.
I would like to ask if the book pointer is always used in both training and inference time? Like mentioned in:

if not cfg.use_true_db_pointer and 'bspn' in decoded:

Yes. Because the booking status lookup table is not available for MultiWOZ, the true booking pointer is used in both training and testing. In contrast, we use the database pointer generated based on the generated belief span if the ground truth is not available.

I see, thanks for your reply.