MLE model training exits without error
ioannist opened this issue · comments
I set up a conda env with python=3.7, pip installed reqquirements, and preprocessed data.
I tried running MLE training with both copy and not copy. Training starts, the model is loaded into GPU memory (about 7GB) and after a couple of minutes, it exits without any error.
Ubuntu 18
Cuda 10.2
Here is output.log
12/03/2020 14:30:53 [INFO] train: Parameters:
12/03/2020 14:30:53 [INFO] train: vocab_size : 50002
12/03/2020 14:30:53 [INFO] train: max_unk_words : 1000
12/03/2020 14:30:53 [INFO] train: words_min_frequency : 0
12/03/2020 14:30:53 [INFO] train: dynamic_dict : True
12/03/2020 14:30:53 [INFO] train: train_discriminator : False
12/03/2020 14:30:53 [INFO] train: word_vec_size : 100
12/03/2020 14:30:53 [INFO] train: share_embeddings : True
12/03/2020 14:30:53 [INFO] train: use_target_encoder : False
12/03/2020 14:30:53 [INFO] train: encoder_type : rnn
12/03/2020 14:30:53 [INFO] train: decoder_type : rnn
12/03/2020 14:30:53 [INFO] train: enc_layers : 1
12/03/2020 14:30:53 [INFO] train: dec_layers : 1
12/03/2020 14:30:53 [INFO] train: encoder_size : 150
12/03/2020 14:30:53 [INFO] train: decoder_size : 300
12/03/2020 14:30:53 [INFO] train: target_encoder_size : 64
12/03/2020 14:30:53 [INFO] train: source_representation_queue_size : 128
12/03/2020 14:30:53 [INFO] train: source_representation_sample_size : 32
12/03/2020 14:30:53 [INFO] train: dropout : 0.1
12/03/2020 14:30:53 [INFO] train: bidirectional : True
12/03/2020 14:30:53 [INFO] train: bridge : copy
12/03/2020 14:30:53 [INFO] train: attn_mode : concat
12/03/2020 14:30:53 [INFO] train: copy_attention : False
12/03/2020 14:30:53 [INFO] train: coverage_attn : False
12/03/2020 14:30:53 [INFO] train: review_attn : False
12/03/2020 14:30:53 [INFO] train: lambda_coverage : 1
12/03/2020 14:30:53 [INFO] train: coverage_loss : False
12/03/2020 14:30:53 [INFO] train: orthogonal_loss : False
12/03/2020 14:30:53 [INFO] train: lambda_orthogonal : 0.03
12/03/2020 14:30:53 [INFO] train: lambda_target_encoder : 0.03
12/03/2020 14:30:53 [INFO] train: separate_present_absent : False
12/03/2020 14:30:53 [INFO] train: manager_mode : 1
12/03/2020 14:30:53 [INFO] train: goal_vector_size : 16
12/03/2020 14:30:53 [INFO] train: goal_vector_mode : 0
12/03/2020 14:30:53 [INFO] train: title_guided : False
12/03/2020 14:30:53 [INFO] train: single_reward : False
12/03/2020 14:30:53 [INFO] train: multiple_rewards : False
12/03/2020 14:30:53 [INFO] train: data : data/kp20k_sorted/
12/03/2020 14:30:53 [INFO] train: vocab : data/kp20k_sorted/
12/03/2020 14:30:53 [INFO] train: custom_data_filename_suffix : False
12/03/2020 14:30:53 [INFO] train: custom_vocab_filename_suffix : False
12/03/2020 14:30:53 [INFO] train: vocab_filename_suffix :
12/03/2020 14:30:53 [INFO] train: data_filename_suffix :
12/03/2020 14:30:53 [INFO] train: save_model : model
12/03/2020 14:30:53 [INFO] train: train_from :
12/03/2020 14:30:53 [INFO] train: gpuid : 0
12/03/2020 14:30:53 [INFO] train: seed : 9527
12/03/2020 14:30:53 [INFO] train: epochs : 25
12/03/2020 14:30:53 [INFO] train: start_epoch : 1
12/03/2020 14:30:53 [INFO] train: param_init : 0.1
12/03/2020 14:30:53 [INFO] train: pre_word_vecs_enc : None
12/03/2020 14:30:53 [INFO] train: pre_word_vecs_dec : None
12/03/2020 14:30:53 [INFO] train: fix_word_vecs_enc : False
12/03/2020 14:30:53 [INFO] train: fix_word_vecs_dec : False
12/03/2020 14:30:53 [INFO] train: batch_size : 32
12/03/2020 14:30:53 [INFO] train: batch_workers : 4
12/03/2020 14:30:53 [INFO] train: optim : adam
12/03/2020 14:30:53 [INFO] train: max_grad_norm : 1
12/03/2020 14:30:53 [INFO] train: truncated_decoder : 0
12/03/2020 14:30:53 [INFO] train: loss_normalization : tokens
12/03/2020 14:30:53 [INFO] train: train_ml : True
12/03/2020 14:30:53 [INFO] train: train_rl : False
12/03/2020 14:30:53 [INFO] train: max_sample_length : 6
12/03/2020 14:30:53 [INFO] train: max_length : 6
12/03/2020 14:30:53 [INFO] train: topk : M
12/03/2020 14:30:53 [INFO] train: reward_type : 0
12/03/2020 14:30:53 [INFO] train: match_type : exact
12/03/2020 14:30:53 [INFO] train: pretrained_model :
12/03/2020 14:30:53 [INFO] train: reward_shaping : False
12/03/2020 14:30:53 [INFO] train: baseline : self
12/03/2020 14:30:53 [INFO] train: mc_rollouts : False
12/03/2020 14:30:53 [INFO] train: num_rollouts : 3
12/03/2020 14:30:53 [INFO] train: delimiter_type : 0
12/03/2020 14:30:53 [INFO] train: one2many : True
12/03/2020 14:30:53 [INFO] train: one2many_mode : 1
12/03/2020 14:30:53 [INFO] train: num_predictions : 1
12/03/2020 14:30:53 [INFO] train: init_perturb_std : 0
12/03/2020 14:30:53 [INFO] train: final_perturb_std : 0
12/03/2020 14:30:53 [INFO] train: perturb_decay_mode : 1
12/03/2020 14:30:53 [INFO] train: perturb_decay_factor : 0.0001
12/03/2020 14:30:53 [INFO] train: perturb_baseline : False
12/03/2020 14:30:53 [INFO] train: regularization_type : 0
12/03/2020 14:30:53 [INFO] train: regularization_factor : 0.0
12/03/2020 14:30:53 [INFO] train: replace_unk : False
12/03/2020 14:30:53 [INFO] train: remove_src_eos : False
12/03/2020 14:30:53 [INFO] train: must_teacher_forcing : False
12/03/2020 14:30:53 [INFO] train: teacher_forcing_ratio : 0
12/03/2020 14:30:53 [INFO] train: scheduled_sampling : False
12/03/2020 14:30:53 [INFO] train: scheduled_sampling_batches : 10000
12/03/2020 14:30:53 [INFO] train: learning_rate : 0.001
12/03/2020 14:30:53 [INFO] train: learning_rate_rl : 5e-05
12/03/2020 14:30:53 [INFO] train: learning_rate_decay_rl : False
12/03/2020 14:30:53 [INFO] train: learning_rate_decay : 0.5
12/03/2020 14:30:53 [INFO] train: start_decay_at : 8
12/03/2020 14:30:53 [INFO] train: start_checkpoint_at : 2
12/03/2020 14:30:53 [INFO] train: decay_method :
12/03/2020 14:30:53 [INFO] train: warmup_steps : 4000
12/03/2020 14:30:53 [INFO] train: checkpoint_interval : 4000
12/03/2020 14:30:53 [INFO] train: disable_early_stop_rl : False
12/03/2020 14:30:53 [INFO] train: early_stop_tolerance : 4
12/03/2020 14:30:53 [INFO] train: timemark : 20201203-143053
12/03/2020 14:30:53 [INFO] train: report_every : 10
12/03/2020 14:30:53 [INFO] train: exp : kp20k.ml.one2many.cat.bi-directional
12/03/2020 14:30:53 [INFO] train: exp_path : exp/kp20k.ml.one2many.cat.bi-directional.20201203-143053
12/03/2020 14:30:53 [INFO] train: model_path : model/kp20k.ml.one2many.cat.bi-directional.20201203-143053
12/03/2020 14:30:53 [INFO] train: delimiter_word : <sep>
12/03/2020 14:30:53 [INFO] train: input_feeding : False
12/03/2020 14:30:53 [INFO] train: copy_input_feeding : False
12/03/2020 14:30:53 [INFO] train: device : cuda:0
12/03/2020 14:30:53 [INFO] data_loader: Loading vocab from disk: data/kp20k_sorted/
12/03/2020 14:30:53 [INFO] data_loader: #(vocab)=344733
12/03/2020 14:30:53 [INFO] data_loader: #(vocab used)=50002
12/03/2020 14:30:53 [INFO] data_loader: Loading train and validate data from 'data/kp20k_sorted/'
12/03/2020 14:30:53 [INFO] data_loader: #(train data size: #(batch)=32
12/03/2020 14:30:53 [INFO] data_loader: #(valid data size: #(batch)=32
12/03/2020 14:30:53 [INFO] train: Time for loading the data: 0.3
12/03/2020 14:30:53 [INFO] train: ====================== Model Parameters =========================
12/03/2020 14:30:53 [INFO] train: Training a seq2seq model
12/03/2020 14:30:56 [INFO] train_ml: ====================== Start Training =========================
12/03/2020 14:33:36 [INFO] train: Time for training: 162.9
I have the same question with you. How did you solve it?
I found out that the problem is on the dataset the author provide. Use dataset from here solve this issue.