====================================================================================================
- data : ../data/wikitext-103/
- dataset : wt103
- n_layer : 16
- n_head : 10
- d_head : 41
- d_embed : 410
- d_model : 410
- d_inner : 2100
- dropout : 0.1
- dropatt : 0.0
- init : normal
- emb_init : normal
- init_range : 0.1
- emb_init_range : 0.01
- init_std : 0.02
- proj_init_std : 0.01
- optim : adam
- lr : 0.00025
- mom : 0.0
- scheduler : cosine
- warmup_step : 0
- decay_rate : 0.5
- lr_min : 0.0
- clip : 0.25
- clip_nonemb : False
- max_step : 200000
- batch_size : 16
- batch_chunk : 4
- tgt_len : 150
- eval_tgt_len : 150
- ext_len : 0
- mem_len : 150
- not_tied : False
- seed : 1111
- cuda : True
- adaptive : True
- div_val : 1
- pre_lnorm : False
- varlen : False
- multi_gpu : False
- log_interval : 200
- eval_interval : 4000
- work_dir : LM-TFM-wt103/20210921-182940
- restart : False
- restart_dir :
- debug : False
- same_length : False
- attn_type : 0
- clamp_len : -1
- eta_min : 0.0
- gpu0_bsz : 4
- max_eval_steps : -1
- sample_softmax : -1
- patience : 0
- finetune_v2 : False
- finetune_v3 : False
- fp16 : False
- static_loss_scale : 1
- dynamic_loss_scale : False
- tied : True
- n_token : 267735
- n_all_param : 151107538
- n_nonemb_param : 41066400
====================================================================================================
#params = 151107538
#non emb params = 41066400
| epoch 1 step 200 | 200 batches | lr 0.00025 | ms/batch 493.63 | loss 7.13 | ppl 1254.910
| epoch 1 step 400 | 400 batches | lr 0.00025 | ms/batch 493.89 | loss 6.40 | ppl 603.377
| epoch 1 step 600 | 600 batches | lr 0.00025 | ms/batch 493.86 | loss 6.09 | ppl 443.252
| epoch 1 step 800 | 800 batches | lr 0.00025 | ms/batch 494.39 | loss 5.95 | ppl 383.134
| epoch 1 step 1000 | 1000 batches | lr 0.00025 | ms/batch 495.03 | loss 5.79 | ppl 325.725
| epoch 1 step 1200 | 1200 batches | lr 0.00025 | ms/batch 496.30 | loss 5.67 | ppl 289.864
| epoch 1 step 1400 | 1400 batches | lr 0.00025 | ms/batch 494.21 | loss 5.56 | ppl 261.054
| epoch 1 step 1600 | 1600 batches | lr 0.00025 | ms/batch 494.64 | loss 5.48 | ppl 239.977
| epoch 1 step 1800 | 1800 batches | lr 0.00025 | ms/batch 494.97 | loss 5.38 | ppl 217.782
| epoch 1 step 2000 | 2000 batches | lr 0.00025 | ms/batch 494.88 | loss 5.31 | ppl 201.882
| epoch 1 step 2200 | 2200 batches | lr 0.00025 | ms/batch 495.40 | loss 5.33 | ppl 206.980
| epoch 1 step 2400 | 2400 batches | lr 0.00025 | ms/batch 494.62 | loss 5.21 | ppl 183.674
| epoch 1 step 2600 | 2600 batches | lr 0.00025 | ms/batch 496.11 | loss 5.19 | ppl 179.883
| epoch 1 step 2800 | 2800 batches | lr 0.00025 | ms/batch 496.94 | loss 5.19 | ppl 179.249
| epoch 1 step 3000 | 3000 batches | lr 0.00025 | ms/batch 497.89 | loss 5.15 | ppl 172.448
| epoch 1 step 3200 | 3200 batches | lr 0.00025 | ms/batch 496.77 | loss 5.07 | ppl 159.928
| epoch 1 step 3400 | 3400 batches | lr 0.00025 | ms/batch 497.92 | loss 5.04 | ppl 154.906
| epoch 1 step 3600 | 3600 batches | lr 0.00025 | ms/batch 495.40 | loss 4.96 | ppl 142.222
| epoch 1 step 3800 | 3800 batches | lr 0.00025 | ms/batch 495.15 | loss 5.04 | ppl 155.004
| epoch 1 step 4000 | 4000 batches | lr 0.00025 | ms/batch 493.45 | loss 4.91 | ppl 136.168
----------------------------------------------------------------------------------------------------
| Eval 1 at step 4000 | time: 1991.50s | valid loss 4.92 | valid ppl 137.152
----------------------------------------------------------------------------------------------------
| epoch 1 step 4200 | 4200 batches | lr 0.00025 | ms/batch 553.14 | loss 4.93 | ppl 138.226
| epoch 1 step 4400 | 4400 batches | lr 0.00025 | ms/batch 494.08 | loss 4.86 | ppl 128.969
| epoch 1 step 4600 | 4600 batches | lr 0.00025 | ms/batch 495.18 | loss 4.88 | ppl 131.913
| epoch 1 step 4800 | 4800 batches | lr 0.00025 | ms/batch 495.28 | loss 4.90 | ppl 133.687
| epoch 1 step 5000 | 5000 batches | lr 0.00025 | ms/batch 494.85 | loss 4.85 | ppl 127.346
| epoch 1 step 5200 | 5200 batches | lr 0.00025 | ms/batch 495.66 | loss 4.82 | ppl 123.937
| epoch 1 step 5400 | 5400 batches | lr 0.00025 | ms/batch 495.57 | loss 4.82 | ppl 124.102
| epoch 1 step 5600 | 5600 batches | lr 0.00025 | ms/batch 496.71 | loss 4.82 | ppl 123.789
| epoch 1 step 5800 | 5800 batches | lr 0.000249 | ms/batch 496.55 | loss 4.82 | ppl 123.759
| epoch 1 step 6000 | 6000 batches | lr 0.000249 | ms/batch 494.19 | loss 4.71 | ppl 110.718
| epoch 1 step 6200 | 6200 batches | lr 0.000249 | ms/batch 496.46 | loss 4.78 | ppl 118.613
| epoch 1 step 6400 | 6400 batches | lr 0.000249 | ms/batch 496.31 | loss 4.73 | ppl 113.399
| epoch 1 step 6600 | 6600 batches | lr 0.000249 | ms/batch 495.11 | loss 4.70 | ppl 110.007
| epoch 1 step 6800 | 6800 batches | lr 0.000249 | ms/batch 494.73 | loss 4.70 | ppl 109.674
| epoch 1 step 7000 | 7000 batches | lr 0.000249 | ms/batch 496.12 | loss 4.67 | ppl 106.686
| epoch 1 step 7200 | 7200 batches | lr 0.000249 | ms/batch 495.62 | loss 4.61 | ppl 100.260
| epoch 1 step 7400 | 7400 batches | lr 0.000249 | ms/batch 495.47 | loss 4.70 | ppl 109.966
| epoch 1 step 7600 | 7600 batches | lr 0.000249 | ms/batch 495.16 | loss 4.63 | ppl 102.847
| epoch 1 step 7800 | 7800 batches | lr 0.000249 | ms/batch 496.97 | loss 4.68 | ppl 107.495
| epoch 1 step 8000 | 8000 batches | lr 0.000249 | ms/batch 496.60 | loss 4.60 | ppl 99.703
----------------------------------------------------------------------------------------------------
| Eval 2 at step 8000 | time: 1992.63s | valid loss 4.52 | valid ppl 92.106
----------------------------------------------------------------------------------------------------
| epoch 1 step 8200 | 8200 batches | lr 0.000249 | ms/batch 557.80 | loss 4.58 | ppl 97.425
| epoch 1 step 8400 | 8400 batches | lr 0.000249 | ms/batch 494.54 | loss 4.57 | ppl 96.720
| epoch 1 step 8600 | 8600 batches | lr 0.000249 | ms/batch 495.85 | loss 4.58 | ppl 97.233
| epoch 1 step 8800 | 8800 batches | lr 0.000249 | ms/batch 495.76 | loss 4.57 | ppl 96.731
| epoch 1 step 9000 | 9000 batches | lr 0.000249 | ms/batch 496.09 | loss 4.57 | ppl 96.297
| epoch 1 step 9200 | 9200 batches | lr 0.000249 | ms/batch 495.72 | loss 4.52 | ppl 92.185
| epoch 1 step 9400 | 9400 batches | lr 0.000249 | ms/batch 495.37 | loss 4.53 | ppl 92.500
| epoch 1 step 9600 | 9600 batches | lr 0.000249 | ms/batch 496.28 | loss 4.54 | ppl 93.689
| epoch 1 step 9800 | 9800 batches | lr 0.000249 | ms/batch 497.26 | loss 4.54 | ppl 93.596
| epoch 1 step 10000 | 10000 batches | lr 0.000248 | ms/batch 495.58 | loss 4.51 | ppl 90.923
| epoch 1 step 10200 | 10200 batches | lr 0.000248 | ms/batch 493.49 | loss 4.45 | ppl 86.041
| epoch 1 step 10400 | 10400 batches | lr 0.000248 | ms/batch 495.47 | loss 4.52 | ppl 92.075
| epoch 1 step 10600 | 10600 batches | lr 0.000248 | ms/batch 496.26 | loss 4.56 | ppl 95.282
| epoch 1 step 10800 | 10800 batches | lr 0.000248 | ms/batch 496.28 | loss 4.56 | ppl 95.310
| epoch 1 step 11000 | 11000 batches | lr 0.000248 | ms/batch 495.41 | loss 4.44 | ppl 84.739
| epoch 1 step 11200 | 11200 batches | lr 0.000248 | ms/batch 494.72 | loss 4.45 | ppl 85.724
| epoch 1 step 11400 | 11400 batches | lr 0.000248 | ms/batch 495.96 | loss 4.49 | ppl 88.765
| epoch 1 step 11600 | 11600 batches | lr 0.000248 | ms/batch 495.57 | loss 4.48 | ppl 88.031
| epoch 1 step 11800 | 11800 batches | lr 0.000248 | ms/batch 493.73 | loss 4.40 | ppl 81.858
| epoch 1 step 12000 | 12000 batches | lr 0.000248 | ms/batch 495.55 | loss 4.43 | ppl 83.758
----------------------------------------------------------------------------------------------------
| Eval 3 at step 12000 | time: 1992.34s | valid loss 4.36 | valid ppl 78.589
----------------------------------------------------------------------------------------------------
| epoch 1 step 12200 | 12200 batches | lr 0.000248 | ms/batch 559.15 | loss 4.34 | ppl 76.575
| epoch 1 step 12400 | 12400 batches | lr 0.000248 | ms/batch 495.18 | loss 4.49 | ppl 88.876
| epoch 1 step 12600 | 12600 batches | lr 0.000248 | ms/batch 494.69 | loss 4.40 | ppl 81.145
| epoch 1 step 12800 | 12800 batches | lr 0.000247 | ms/batch 494.14 | loss 4.37 | ppl 79.132
| epoch 1 step 13000 | 13000 batches | lr 0.000247 | ms/batch 495.72 | loss 4.36 | ppl 77.931
| epoch 1 step 13200 | 13200 batches | lr 0.000247 | ms/batch 495.80 | loss 4.39 | ppl 80.577
| epoch 1 step 13400 | 13400 batches | lr 0.000247 | ms/batch 495.00 | loss 4.33 | ppl 75.747
| epoch 1 step 13600 | 13600 batches | lr 0.000247 | ms/batch 496.33 | loss 4.44 | ppl 84.548
| epoch 1 step 13800 | 13800 batches | lr 0.000247 | ms/batch 493.89 | loss 4.34 | ppl 76.400
| epoch 1 step 14000 | 14000 batches | lr 0.000247 | ms/batch 494.09 | loss 4.33 | ppl 75.718
| epoch 1 step 14200 | 14200 batches | lr 0.000247 | ms/batch 494.65 | loss 4.32 | ppl 75.012
| epoch 1 step 14400 | 14400 batches | lr 0.000247 | ms/batch 495.00 | loss 4.39 | ppl 80.497
| epoch 1 step 14600 | 14600 batches | lr 0.000247 | ms/batch 495.23 | loss 4.32 | ppl 75.051
| epoch 1 step 14800 | 14800 batches | lr 0.000247 | ms/batch 496.61 | loss 4.41 | ppl 82.513
| epoch 1 step 15000 | 15000 batches | lr 0.000247 | ms/batch 496.71 | loss 4.43 | ppl 83.742
| epoch 1 step 15200 | 15200 batches | lr 0.000246 | ms/batch 495.23 | loss 4.32 | ppl 74.858
| epoch 1 step 15400 | 15400 batches | lr 0.000246 | ms/batch 494.87 | loss 4.31 | ppl 74.685
| epoch 1 step 15600 | 15600 batches | lr 0.000246 | ms/batch 495.58 | loss 4.31 | ppl 74.456
| epoch 1 step 15800 | 15800 batches | lr 0.000246 | ms/batch 495.30 | loss 4.28 | ppl 72.591
| epoch 1 step 16000 | 16000 batches | lr 0.000246 | ms/batch 495.05 | loss 4.34 | ppl 76.342
----------------------------------------------------------------------------------------------------
| Eval 4 at step 16000 | time: 1991.43s | valid loss 4.22 | valid ppl 67.748
----------------------------------------------------------------------------------------------------
| epoch 1 step 16200 | 16200 batches | lr 0.000246 | ms/batch 557.94 | loss 4.23 | ppl 68.568
| epoch 1 step 16400 | 16400 batches | lr 0.000246 | ms/batch 495.36 | loss 4.30 | ppl 73.653
| epoch 1 step 16600 | 16600 batches | lr 0.000246 | ms/batch 496.68 | loss 4.33 | ppl 76.010
| epoch 1 step 16800 | 16800 batches | lr 0.000246 | ms/batch 494.43 | loss 4.27 | ppl 71.756
| epoch 1 step 17000 | 17000 batches | lr 0.000246 | ms/batch 495.25 | loss 4.31 | ppl 74.142
| epoch 1 step 17200 | 17200 batches | lr 0.000245 | ms/batch 493.69 | loss 4.23 | ppl 68.605
| epoch 1 step 17400 | 17400 batches | lr 0.000245 | ms/batch 496.04 | loss 4.28 | ppl 71.968
| epoch 1 step 17600 | 17600 batches | lr 0.000245 | ms/batch 494.80 | loss 4.28 | ppl 71.954
| epoch 1 step 17800 | 17800 batches | lr 0.000245 | ms/batch 496.22 | loss 4.29 | ppl 73.030
| epoch 1 step 18000 | 18000 batches | lr 0.000245 | ms/batch 494.77 | loss 4.28 | ppl 72.048
| epoch 1 step 18200 | 18200 batches | lr 0.000245 | ms/batch 496.33 | loss 4.29 | ppl 72.879
| epoch 1 step 18400 | 18400 batches | lr 0.000245 | ms/batch 494.48 | loss 4.27 | ppl 71.578
| epoch 1 step 18600 | 18600 batches | lr 0.000245 | ms/batch 495.50 | loss 4.31 | ppl 74.375
| epoch 1 step 18800 | 18800 batches | lr 0.000245 | ms/batch 494.74 | loss 4.23 | ppl 68.390
| epoch 1 step 19000 | 19000 batches | lr 0.000244 | ms/batch 494.98 | loss 4.25 | ppl 70.149
| epoch 1 step 19200 | 19200 batches | lr 0.000244 | ms/batch 494.33 | loss 4.29 | ppl 72.882
| epoch 1 step 19400 | 19400 batches | lr 0.000244 | ms/batch 495.57 | loss 4.19 | ppl 66.004
| epoch 1 step 19600 | 19600 batches | lr 0.000244 | ms/batch 495.22 | loss 4.26 | ppl 70.812
| epoch 1 step 19800 | 19800 batches | lr 0.000244 | ms/batch 493.93 | loss 4.26 | ppl 70.826
| epoch 1 step 20000 | 20000 batches | lr 0.000244 | ms/batch 495.25 | loss 4.23 | ppl 68.646
----------------------------------------------------------------------------------------------------
| Eval 5 at step 20000 | time: 1990.91s | valid loss 4.13 | valid ppl 61.969
----------------------------------------------------------------------------------------------------
| epoch 1 step 20200 | 20200 batches | lr 0.000244 | ms/batch 559.08 | loss 4.23 | ppl 68.973
| epoch 1 step 20400 | 20400 batches | lr 0.000244 | ms/batch 494.76 | loss 4.27 | ppl 71.687
| epoch 1 step 20600 | 20600 batches | lr 0.000244 | ms/batch 495.68 | loss 4.21 | ppl 67.664
| epoch 1 step 20800 | 20800 batches | lr 0.000243 | ms/batch 494.72 | loss 4.26 | ppl 70.515
| epoch 1 step 21000 | 21000 batches | lr 0.000243 | ms/batch 494.79 | loss 4.21 | ppl 67.453
| epoch 1 step 21200 | 21200 batches | lr 0.000243 | ms/batch 497.26 | loss 4.28 | ppl 72.279
| epoch 1 step 21400 | 21400 batches | lr 0.000243 | ms/batch 495.60 | loss 4.23 | ppl 68.897
| epoch 1 step 21600 | 21600 batches | lr 0.000243 | ms/batch 495.42 | loss 4.22 | ppl 68.008
| epoch 1 step 21800 | 21800 batches | lr 0.000243 | ms/batch 494.56 | loss 4.22 | ppl 67.926
| epoch 1 step 22000 | 22000 batches | lr 0.000243 | ms/batch 495.34 | loss 4.22 | ppl 67.941
| epoch 1 step 22200 | 22200 batches | lr 0.000242 | ms/batch 494.26 | loss 4.18 | ppl 65.468
| epoch 1 step 22400 | 22400 batches | lr 0.000242 | ms/batch 497.13 | loss 4.30 | ppl 74.059
| epoch 1 step 22600 | 22600 batches | lr 0.000242 | ms/batch 495.71 | loss 4.25 | ppl 70.022
| epoch 1 step 22800 | 22800 batches | lr 0.000242 | ms/batch 496.27 | loss 4.21 | ppl 67.276
| epoch 1 step 23000 | 23000 batches | lr 0.000242 | ms/batch 495.30 | loss 4.18 | ppl 65.076
| epoch 1 step 23200 | 23200 batches | lr 0.000242 | ms/batch 495.81 | loss 4.22 | ppl 67.702
| epoch 1 step 23400 | 23400 batches | lr 0.000242 | ms/batch 496.40 | loss 4.28 | ppl 72.323
| epoch 1 step 23600 | 23600 batches | lr 0.000242 | ms/batch 496.13 | loss 4.20 | ppl 66.390
| epoch 1 step 23800 | 23800 batches | lr 0.000241 | ms/batch 496.34 | loss 4.18 | ppl 65.382
| epoch 1 step 24000 | 24000 batches | lr 0.000241 | ms/batch 494.92 | loss 4.20 | ppl 66.459
----------------------------------------------------------------------------------------------------
| Eval 6 at step 24000 | time: 1992.92s | valid loss 4.04 | valid ppl 57.046
----------------------------------------------------------------------------------------------------
| epoch 1 step 24200 | 24200 batches | lr 0.000241 | ms/batch 559.25 | loss 4.21 | ppl 67.595
| epoch 1 step 24400 | 24400 batches | lr 0.000241 | ms/batch 496.58 | loss 4.23 | ppl 68.785
| epoch 1 step 24600 | 24600 batches | lr 0.000241 | ms/batch 495.04 | loss 4.18 | ppl 65.233
| epoch 1 step 24800 | 24800 batches | lr 0.000241 | ms/batch 494.46 | loss 4.09 | ppl 59.573
| epoch 1 step 25000 | 25000 batches | lr 0.00024 | ms/batch 495.32 | loss 4.15 | ppl 63.328
| epoch 1 step 25200 | 25200 batches | lr 0.00024 | ms/batch 495.26 | loss 4.20 | ppl 66.452
| epoch 1 step 25400 | 25400 batches | lr 0.00024 | ms/batch 494.97 | loss 4.18 | ppl 65.285
| epoch 1 step 25600 | 25600 batches | lr 0.00024 | ms/batch 495.18 | loss 4.19 | ppl 66.063
| epoch 1 step 25800 | 25800 batches | lr 0.00024 | ms/batch 495.22 | loss 4.12 | ppl 61.621
| epoch 1 step 26000 | 26000 batches | lr 0.00024 | ms/batch 495.01 | loss 4.13 | ppl 62.086
| epoch 1 step 26200 | 26200 batches | lr 0.00024 | ms/batch 494.93 | loss 4.17 | ppl 64.889
| epoch 1 step 26400 | 26400 batches | lr 0.000239 | ms/batch 495.86 | loss 4.15 | ppl 63.361
| epoch 1 step 26600 | 26600 batches | lr 0.000239 | ms/batch 496.34 | loss 4.18 | ppl 65.286
| epoch 1 step 26800 | 26800 batches | lr 0.000239 | ms/batch 494.61 | loss 4.12 | ppl 61.841
| epoch 1 step 27000 | 27000 batches | lr 0.000239 | ms/batch 494.51 | loss 4.14 | ppl 62.551
| epoch 1 step 27200 | 27200 batches | lr 0.000239 | ms/batch 494.92 | loss 4.06 | ppl 58.149
| epoch 1 step 27400 | 27400 batches | lr 0.000239 | ms/batch 495.30 | loss 4.09 | ppl 60.002
| epoch 1 step 27600 | 27600 batches | lr 0.000238 | ms/batch 494.81 | loss 4.14 | ppl 62.604
| epoch 1 step 27800 | 27800 batches | lr 0.000238 | ms/batch 495.34 | loss 4.14 | ppl 62.556
| epoch 1 step 28000 | 28000 batches | lr 0.000238 | ms/batch 494.90 | loss 4.14 | ppl 62.946
----------------------------------------------------------------------------------------------------
| Eval 7 at step 28000 | time: 1991.22s | valid loss 3.97 | valid ppl 53.117
----------------------------------------------------------------------------------------------------
| epoch 1 step 28200 | 28200 batches | lr 0.000238 | ms/batch 558.32 | loss 4.21 | ppl 67.249
| epoch 1 step 28400 | 28400 batches | lr 0.000238 | ms/batch 495.47 | loss 4.16 | ppl 63.903
| epoch 1 step 28600 | 28600 batches | lr 0.000238 | ms/batch 496.60 | loss 4.16 | ppl 64.322
| epoch 1 step 28800 | 28800 batches | lr 0.000237 | ms/batch 496.06 | loss 4.10 | ppl 60.291
| epoch 1 step 29000 | 29000 batches | lr 0.000237 | ms/batch 495.81 | loss 4.09 | ppl 59.745
| epoch 1 step 29200 | 29200 batches | lr 0.000237 | ms/batch 494.19 | loss 4.10 | ppl 60.189
| epoch 1 step 29400 | 29400 batches | lr 0.000237 | ms/batch 495.97 | loss 4.18 | ppl 65.042
| epoch 1 step 29600 | 29600 batches | lr 0.000237 | ms/batch 494.26 | loss 4.09 | ppl 59.554
| epoch 1 step 29800 | 29800 batches | lr 0.000237 | ms/batch 496.95 | loss 4.15 | ppl 63.361
| epoch 1 step 30000 | 30000 batches | lr 0.000236 | ms/batch 495.55 | loss 4.19 | ppl 66.002
| epoch 1 step 30200 | 30200 batches | lr 0.000236 | ms/batch 496.41 | loss 4.16 | ppl 64.291
| epoch 1 step 30400 | 30400 batches | lr 0.000236 | ms/batch 494.15 | loss 4.10 | ppl 60.164
| epoch 1 step 30600 | 30600 batches | lr 0.000236 | ms/batch 496.83 | loss 4.18 | ppl 65.406
| epoch 1 step 30800 | 30800 batches | lr 0.000236 | ms/batch 496.48 | loss 4.12 | ppl 61.410
| epoch 1 step 31000 | 31000 batches | lr 0.000235 | ms/batch 495.90 | loss 4.13 | ppl 62.201
| epoch 1 step 31200 | 31200 batches | lr 0.000235 | ms/batch 496.49 | loss 4.05 | ppl 57.348
| epoch 1 step 31400 | 31400 batches | lr 0.000235 | ms/batch 496.77 | loss 4.17 | ppl 64.479
| epoch 1 step 31600 | 31600 batches | lr 0.000235 | ms/batch 494.30 | loss 4.10 | ppl 60.633
| epoch 1 step 31800 | 31800 batches | lr 0.000235 | ms/batch 496.40 | loss 4.13 | ppl 62.229
| epoch 1 step 32000 | 32000 batches | lr 0.000235 | ms/batch 495.74 | loss 4.10 | ppl 60.596
----------------------------------------------------------------------------------------------------
| Eval 8 at step 32000 | time: 1993.61s | valid loss 3.93 | valid ppl 51.052
----------------------------------------------------------------------------------------------------
| epoch 1 step 32200 | 32200 batches | lr 0.000234 | ms/batch 557.24 | loss 4.10 | ppl 60.198
| epoch 1 step 32400 | 32400 batches | lr 0.000234 | ms/batch 495.39 | loss 4.05 | ppl 57.633
| epoch 1 step 32600 | 32600 batches | lr 0.000234 | ms/batch 493.81 | loss 4.05 | ppl 57.556
| epoch 1 step 32800 | 32800 batches | lr 0.000234 | ms/batch 493.47 | loss 3.98 | ppl 53.459
| epoch 1 step 33000 | 33000 batches | lr 0.000234 | ms/batch 495.80 | loss 4.07 | ppl 58.451
| epoch 1 step 33200 | 33200 batches | lr 0.000233 | ms/batch 494.68 | loss 4.12 | ppl 61.372
| epoch 1 step 33400 | 33400 batches | lr 0.000233 | ms/batch 493.90 | loss 4.02 | ppl 55.470
| epoch 1 step 33600 | 33600 batches | lr 0.000233 | ms/batch 494.79 | loss 4.07 | ppl 58.691
| epoch 1 step 33800 | 33800 batches | lr 0.000233 | ms/batch 494.25 | loss 4.10 | ppl 60.156
| epoch 1 step 34000 | 34000 batches | lr 0.000233 | ms/batch 495.29 | loss 4.09 | ppl 59.498
| epoch 1 step 34200 | 34200 batches | lr 0.000232 | ms/batch 495.41 | loss 4.12 | ppl 61.711
| epoch 1 step 34400 | 34400 batches | lr 0.000232 | ms/batch 495.29 | loss 4.10 | ppl 60.081
| epoch 1 step 34600 | 34600 batches | lr 0.000232 | ms/batch 495.62 | loss 4.07 | ppl 58.368
| epoch 1 step 34800 | 34800 batches | lr 0.000232 | ms/batch 494.25 | loss 3.97 | ppl 53.031
| epoch 1 step 35000 | 35000 batches | lr 0.000232 | ms/batch 494.85 | loss 4.03 | ppl 56.116
| epoch 1 step 35200 | 35200 batches | lr 0.000231 | ms/batch 496.65 | loss 4.05 | ppl 57.620
| epoch 1 step 35400 | 35400 batches | lr 0.000231 | ms/batch 497.25 | loss 4.12 | ppl 61.276
| epoch 1 step 35600 | 35600 batches | lr 0.000231 | ms/batch 495.56 | loss 4.07 | ppl 58.342
| epoch 1 step 35800 | 35800 batches | lr 0.000231 | ms/batch 494.66 | loss 4.06 | ppl 57.992
| epoch 1 step 36000 | 36000 batches | lr 0.000231 | ms/batch 494.61 | loss 4.02 | ppl 55.884
----------------------------------------------------------------------------------------------------
| Eval 9 at step 36000 | time: 1990.56s | valid loss 3.88 | valid ppl 48.337
----------------------------------------------------------------------------------------------------
| epoch 1 step 36200 | 36200 batches | lr 0.00023 | ms/batch 557.43 | loss 4.09 | ppl 59.777
| epoch 1 step 36400 | 36400 batches | lr 0.00023 | ms/batch 494.63 | loss 4.09 | ppl 59.574
| epoch 1 step 36600 | 36600 batches | lr 0.00023 | ms/batch 495.90 | loss 4.07 | ppl 58.375
| epoch 1 step 36800 | 36800 batches | lr 0.00023 | ms/batch 496.20 | loss 4.10 | ppl 60.599
| epoch 1 step 37000 | 37000 batches | lr 0.000229 | ms/batch 494.16 | loss 4.01 | ppl 55.261
| epoch 1 step 37200 | 37200 batches | lr 0.000229 | ms/batch 494.10 | loss 4.01 | ppl 55.018
| epoch 1 step 37400 | 37400 batches | lr 0.000229 | ms/batch 495.78 | loss 4.07 | ppl 58.684
| epoch 1 step 37600 | 37600 batches | lr 0.000229 | ms/batch 496.04 | loss 4.07 | ppl 58.281
| epoch 1 step 37800 | 37800 batches | lr 0.000229 | ms/batch 494.88 | loss 4.06 | ppl 58.206
| epoch 1 step 38000 | 38000 batches | lr 0.000228 | ms/batch 494.31 | loss 4.01 | ppl 55.040
| epoch 1 step 38200 | 38200 batches | lr 0.000228 | ms/batch 495.86 | loss 4.05 | ppl 57.358
| epoch 1 step 38400 | 38400 batches | lr 0.000228 | ms/batch 495.35 | loss 4.06 | ppl 57.934
| epoch 1 step 38600 | 38600 batches | lr 0.000228 | ms/batch 495.88 | loss 4.07 | ppl 58.393
| epoch 1 step 38800 | 38800 batches | lr 0.000227 | ms/batch 496.58 | loss 4.05 | ppl 57.265
| epoch 1 step 39000 | 39000 batches | lr 0.000227 | ms/batch 495.65 | loss 4.04 | ppl 56.597
| epoch 1 step 39200 | 39200 batches | lr 0.000227 | ms/batch 495.46 | loss 4.07 | ppl 58.739
| epoch 1 step 39400 | 39400 batches | lr 0.000227 | ms/batch 495.33 | loss 4.03 | ppl 56.401
| epoch 1 step 39600 | 39600 batches | lr 0.000227 | ms/batch 495.20 | loss 3.94 | ppl 51.659
| epoch 1 step 39800 | 39800 batches | lr 0.000226 | ms/batch 495.55 | loss 4.00 | ppl 54.441
| epoch 1 step 40000 | 40000 batches | lr 0.000226 | ms/batch 495.02 | loss 3.95 | ppl 51.827
----------------------------------------------------------------------------------------------------
| Eval 10 at step 40000 | time: 1991.55s | valid loss 3.84 | valid ppl 46.715
----------------------------------------------------------------------------------------------------
| epoch 1 step 40200 | 40200 batches | lr 0.000226 | ms/batch 558.01 | loss 4.08 | ppl 59.363
| epoch 1 step 40400 | 40400 batches | lr 0.000226 | ms/batch 496.85 | loss 4.04 | ppl 56.787
| epoch 1 step 40600 | 40600 batches | lr 0.000225 | ms/batch 495.07 | loss 4.01 | ppl 55.161
| epoch 1 step 40800 | 40800 batches | lr 0.000225 | ms/batch 494.20 | loss 4.03 | ppl 56.332
| epoch 1 step 41000 | 41000 batches | lr 0.000225 | ms/batch 495.88 | loss 4.06 | ppl 57.847
| epoch 1 step 41200 | 41200 batches | lr 0.000225 | ms/batch 495.97 | loss 4.05 | ppl 57.524
| epoch 1 step 41400 | 41400 batches | lr 0.000224 | ms/batch 496.80 | loss 4.08 | ppl 59.077
| epoch 1 step 41600 | 41600 batches | lr 0.000224 | ms/batch 495.89 | loss 4.06 | ppl 58.196
| epoch 1 step 41800 | 41800 batches | lr 0.000224 | ms/batch 496.28 | loss 4.03 | ppl 56.345
| epoch 1 step 42000 | 42000 batches | lr 0.000224 | ms/batch 495.94 | loss 4.02 | ppl 55.501
| epoch 1 step 42200 | 42200 batches | lr 0.000224 | ms/batch 496.29 | loss 4.06 | ppl 58.231
| epoch 1 step 42400 | 42400 batches | lr 0.000223 | ms/batch 495.29 | loss 3.97 | ppl 52.860
| epoch 1 step 42600 | 42600 batches | lr 0.000223 | ms/batch 496.21 | loss 4.06 | ppl 57.864
| epoch 1 step 42800 | 42800 batches | lr 0.000223 | ms/batch 496.03 | loss 4.04 | ppl 56.727
| epoch 1 step 43000 | 43000 batches | lr 0.000223 | ms/batch 495.90 | loss 3.98 | ppl 53.271
| epoch 2 step 43200 | 188 batches | lr 0.000222 | ms/batch 493.50 | loss 3.92 | ppl 50.609
| epoch 2 step 43400 | 388 batches | lr 0.000222 | ms/batch 495.47 | loss 4.01 | ppl 55.227
| epoch 2 step 43600 | 588 batches | lr 0.000222 | ms/batch 495.32 | loss 3.94 | ppl 51.605
| epoch 2 step 43800 | 788 batches | lr 0.000222 | ms/batch 495.97 | loss 3.97 | ppl 53.222
| epoch 2 step 44000 | 988 batches | lr 0.000221 | ms/batch 495.38 | loss 3.97 | ppl 52.989
----------------------------------------------------------------------------------------------------
| Eval 11 at step 44000 | time: 1993.15s | valid loss 3.83 | valid ppl 46.109
----------------------------------------------------------------------------------------------------
| epoch 2 step 44200 | 1188 batches | lr 0.000221 | ms/batch 559.07 | loss 3.97 | ppl 53.248
| epoch 2 step 44400 | 1388 batches | lr 0.000221 | ms/batch 494.68 | loss 3.96 | ppl 52.667
| epoch 2 step 44600 | 1588 batches | lr 0.000221 | ms/batch 494.35 | loss 3.96 | ppl 52.444
| epoch 2 step 44800 | 1788 batches | lr 0.00022 | ms/batch 494.91 | loss 3.93 | ppl 51.038
| epoch 2 step 45000 | 1988 batches | lr 0.00022 | ms/batch 494.94 | loss 3.87 | ppl 47.784
| epoch 2 step 45200 | 2188 batches | lr 0.00022 | ms/batch 495.42 | loss 4.00 | ppl 54.509
| epoch 2 step 45400 | 2388 batches | lr 0.00022 | ms/batch 494.41 | loss 3.93 | ppl 51.073
| epoch 2 step 45600 | 2588 batches | lr 0.000219 | ms/batch 495.58 | loss 3.96 | ppl 52.590
| epoch 2 step 45800 | 2788 batches | lr 0.000219 | ms/batch 496.16 | loss 4.02 | ppl 55.726
| epoch 2 step 46000 | 2988 batches | lr 0.000219 | ms/batch 496.33 | loss 4.00 | ppl 54.416
| epoch 2 step 46200 | 3188 batches | lr 0.000219 | ms/batch 495.82 | loss 3.98 | ppl 53.610
| epoch 2 step 46400 | 3388 batches | lr 0.000218 | ms/batch 496.96 | loss 3.93 | ppl 51.086
| epoch 2 step 46600 | 3588 batches | lr 0.000218 | ms/batch 495.24 | loss 3.90 | ppl 49.277
| epoch 2 step 46800 | 3788 batches | lr 0.000218 | ms/batch 495.73 | loss 3.98 | ppl 53.441
| epoch 2 step 47000 | 3988 batches | lr 0.000217 | ms/batch 493.89 | loss 3.91 | ppl 50.059
| epoch 2 step 47200 | 4188 batches | lr 0.000217 | ms/batch 494.70 | loss 3.96 | ppl 52.327
| epoch 2 step 47400 | 4388 batches | lr 0.000217 | ms/batch 494.54 | loss 3.90 | ppl 49.602
| epoch 2 step 47600 | 4588 batches | lr 0.000217 | ms/batch 495.68 | loss 3.96 | ppl 52.449
| epoch 2 step 47800 | 4788 batches | lr 0.000216 | ms/batch 496.07 | loss 3.97 | ppl 53.221
| epoch 2 step 48000 | 4988 batches | lr 0.000216 | ms/batch 495.15 | loss 3.95 | ppl 52.156
----------------------------------------------------------------------------------------------------
| Eval 12 at step 48000 | time: 1991.81s | valid loss 3.78 | valid ppl 43.968
----------------------------------------------------------------------------------------------------
| epoch 2 step 48200 | 5188 batches | lr 0.000216 | ms/batch 558.46 | loss 3.92 | ppl 50.446
| epoch 2 step 48400 | 5388 batches | lr 0.000216 | ms/batch 494.79 | loss 3.99 | ppl 53.853
| epoch 2 step 48600 | 5588 batches | lr 0.000215 | ms/batch 496.17 | loss 3.96 | ppl 52.632
| epoch 2 step 48800 | 5788 batches | lr 0.000215 | ms/batch 495.78 | loss 3.99 | ppl 54.164
| epoch 2 step 49000 | 5988 batches | lr 0.000215 | ms/batch 493.52 | loss 3.91 | ppl 49.760
| epoch 2 step 49200 | 6188 batches | lr 0.000214 | ms/batch 495.37 | loss 4.00 | ppl 54.738
| epoch 2 step 49400 | 6388 batches | lr 0.000214 | ms/batch 495.41 | loss 3.97 | ppl 52.745
| epoch 2 step 49600 | 6588 batches | lr 0.000214 | ms/batch 495.17 | loss 3.93 | ppl 50.900
| epoch 2 step 49800 | 6788 batches | lr 0.000214 | ms/batch 494.18 | loss 3.95 | ppl 52.117
| epoch 2 step 50000 | 6988 batches | lr 0.000213 | ms/batch 495.74 | loss 3.94 | ppl 51.518
| epoch 2 step 50200 | 7188 batches | lr 0.000213 | ms/batch 495.40 | loss 3.88 | ppl 48.458
| epoch 2 step 50400 | 7388 batches | lr 0.000213 | ms/batch 495.16 | loss 3.99 | ppl 53.891
| epoch 2 step 50600 | 7588 batches | lr 0.000213 | ms/batch 494.60 | loss 3.92 | ppl 50.577
| epoch 2 step 50800 | 7788 batches | lr 0.000212 | ms/batch 496.40 | loss 4.00 | ppl 54.519
| epoch 2 step 51000 | 7988 batches | lr 0.000212 | ms/batch 496.47 | loss 3.93 | ppl 51.114
| epoch 2 step 51200 | 8188 batches | lr 0.000212 | ms/batch 494.86 | loss 3.89 | ppl 49.058
| epoch 2 step 51400 | 8388 batches | lr 0.000211 | ms/batch 494.24 | loss 3.91 | ppl 50.115
| epoch 2 step 51600 | 8588 batches | lr 0.000211 | ms/batch 495.53 | loss 3.92 | ppl 50.483
| epoch 2 step 51800 | 8788 batches | lr 0.000211 | ms/batch 495.57 | loss 3.93 | ppl 50.802
| epoch 2 step 52000 | 8988 batches | lr 0.000211 | ms/batch 496.46 | loss 3.94 | ppl 51.316
----------------------------------------------------------------------------------------------------
| Eval 13 at step 52000 | time: 1991.74s | valid loss 3.76 | valid ppl 42.893
----------------------------------------------------------------------------------------------------
| epoch 2 step 52200 | 9188 batches | lr 0.00021 | ms/batch 559.02 | loss 3.91 | ppl 49.795
| epoch 2 step 52400 | 9388 batches | lr 0.00021 | ms/batch 495.80 | loss 3.90 | ppl 49.541
| epoch 2 step 52600 | 9588 batches | lr 0.00021 | ms/batch 496.53 | loss 3.94 | ppl 51.324
| epoch 2 step 52800 | 9788 batches | lr 0.000209 | ms/batch 497.30 | loss 3.94 | ppl 51.604
| epoch 2 step 53000 | 9988 batches | lr 0.000209 | ms/batch 495.88 | loss 3.91 | ppl 49.936
| epoch 2 step 53200 | 10188 batches | lr 0.000209 | ms/batch 493.70 | loss 3.90 | ppl 49.410
| epoch 2 step 53400 | 10388 batches | lr 0.000209 | ms/batch 495.61 | loss 3.94 | ppl 51.630
| epoch 2 step 53600 | 10588 batches | lr 0.000208 | ms/batch 496.78 | loss 3.97 | ppl 53.035
| epoch 2 step 53800 | 10788 batches | lr 0.000208 | ms/batch 496.68 | loss 4.02 | ppl 55.639
| epoch 2 step 54000 | 10988 batches | lr 0.000208 | ms/batch 495.55 | loss 3.88 | ppl 48.318
| epoch 2 step 54200 | 11188 batches | lr 0.000207 | ms/batch 495.30 | loss 3.93 | ppl 51.040
| epoch 2 step 54400 | 11388 batches | lr 0.000207 | ms/batch 496.20 | loss 3.95 | ppl 52.192
| epoch 2 step 54600 | 11588 batches | lr 0.000207 | ms/batch 495.87 | loss 3.91 | ppl 50.140
| epoch 2 step 54800 | 11788 batches | lr 0.000206 | ms/batch 494.11 | loss 3.88 | ppl 48.454
| epoch 2 step 55000 | 11988 batches | lr 0.000206 | ms/batch 495.70 | loss 3.91 | ppl 49.908
| epoch 2 step 55200 | 12188 batches | lr 0.000206 | ms/batch 496.02 | loss 3.82 | ppl 45.716
| epoch 2 step 55400 | 12388 batches | lr 0.000206 | ms/batch 495.38 | loss 3.97 | ppl 53.052
| epoch 2 step 55600 | 12588 batches | lr 0.000205 | ms/batch 495.23 | loss 3.89 | ppl 48.827
| epoch 2 step 55800 | 12788 batches | lr 0.000205 | ms/batch 494.48 | loss 3.87 | ppl 47.978
| epoch 2 step 56000 | 12988 batches | lr 0.000205 | ms/batch 496.10 | loss 3.86 | ppl 47.343
----------------------------------------------------------------------------------------------------
| Eval 14 at step 56000 | time: 1993.32s | valid loss 3.74 | valid ppl 42.019
----------------------------------------------------------------------------------------------------
| epoch 2 step 56200 | 13188 batches | lr 0.000204 | ms/batch 558.95 | loss 3.88 | ppl 48.464
| epoch 2 step 56400 | 13388 batches | lr 0.000204 | ms/batch 495.17 | loss 3.84 | ppl 46.385
| epoch 2 step 56600 | 13588 batches | lr 0.000204 | ms/batch 496.78 | loss 3.96 | ppl 52.693
| epoch 2 step 56800 | 13788 batches | lr 0.000203 | ms/batch 494.41 | loss 3.86 | ppl 47.471
| epoch 2 step 57000 | 13988 batches | lr 0.000203 | ms/batch 494.72 | loss 3.84 | ppl 46.647
| epoch 2 step 57200 | 14188 batches | lr 0.000203 | ms/batch 495.21 | loss 3.86 | ppl 47.382
| epoch 2 step 57400 | 14388 batches | lr 0.000203 | ms/batch 495.05 | loss 3.92 | ppl 50.358
| epoch 2 step 57600 | 14588 batches | lr 0.000202 | ms/batch 495.93 | loss 3.86 | ppl 47.396
| epoch 2 step 57800 | 14788 batches | lr 0.000202 | ms/batch 497.39 | loss 3.96 | ppl 52.268
| epoch 2 step 58000 | 14988 batches | lr 0.000202 | ms/batch 497.55 | loss 3.99 | ppl 54.254
| epoch 2 step 58200 | 15188 batches | lr 0.000201 | ms/batch 495.94 | loss 3.86 | ppl 47.381
| epoch 2 step 58400 | 15388 batches | lr 0.000201 | ms/batch 495.54 | loss 3.89 | ppl 48.709
| epoch 2 step 58600 | 15588 batches | lr 0.000201 | ms/batch 495.59 | loss 3.86 | ppl 47.269
| epoch 2 step 58800 | 15788 batches | lr 0.0002 | ms/batch 495.87 | loss 3.86 | ppl 47.621
| epoch 2 step 59000 | 15988 batches | lr 0.0002 | ms/batch 495.48 | loss 3.89 | ppl 49.104
| epoch 2 step 59200 | 16188 batches | lr 0.0002 | ms/batch 495.65 | loss 3.82 | ppl 45.749
| epoch 2 step 59400 | 16388 batches | lr 0.000199 | ms/batch 495.87 | loss 3.86 | ppl 47.636
| epoch 2 step 59600 | 16588 batches | lr 0.000199 | ms/batch 497.22 | loss 3.90 | ppl 49.258
| epoch 2 step 59800 | 16788 batches | lr 0.000199 | ms/batch 495.60 | loss 3.85 | ppl 46.765
| epoch 2 step 60000 | 16988 batches | lr 0.000198 | ms/batch 496.19 | loss 3.89 | ppl 48.787
----------------------------------------------------------------------------------------------------
| Eval 15 at step 60000 | time: 1993.97s | valid loss 3.71 | valid ppl 41.041
----------------------------------------------------------------------------------------------------
| epoch 2 step 60200 | 17188 batches | lr 0.000198 | ms/batch 557.08 | loss 3.83 | ppl 46.022
| epoch 2 step 60400 | 17388 batches | lr 0.000198 | ms/batch 496.48 | loss 3.88 | ppl 48.226
| epoch 2 step 60600 | 17588 batches | lr 0.000198 | ms/batch 495.33 | loss 3.88 | ppl 48.220
| epoch 2 step 60800 | 17788 batches | lr 0.000197 | ms/batch 496.65 | loss 3.88 | ppl 48.574
| epoch 2 step 61000 | 17988 batches | lr 0.000197 | ms/batch 495.47 | loss 3.88 | ppl 48.547
| epoch 2 step 61200 | 18188 batches | lr 0.000197 | ms/batch 496.48 | loss 3.90 | ppl 49.269
| epoch 2 step 61400 | 18388 batches | lr 0.000196 | ms/batch 494.84 | loss 3.88 | ppl 48.486
| epoch 2 step 61600 | 18588 batches | lr 0.000196 | ms/batch 496.53 | loss 3.93 | ppl 50.705
| epoch 2 step 61800 | 18788 batches | lr 0.000196 | ms/batch 495.09 | loss 3.83 | ppl 46.292
| epoch 2 step 62000 | 18988 batches | lr 0.000195 | ms/batch 495.88 | loss 3.87 | ppl 47.904
| epoch 2 step 62200 | 19188 batches | lr 0.000195 | ms/batch 494.84 | loss 3.93 | ppl 51.085
| epoch 2 step 62400 | 19388 batches | lr 0.000195 | ms/batch 496.22 | loss 3.81 | ppl 45.232
| epoch 2 step 62600 | 19588 batches | lr 0.000194 | ms/batch 496.07 | loss 3.89 | ppl 48.883
| epoch 2 step 62800 | 19788 batches | lr 0.000194 | ms/batch 494.66 | loss 3.90 | ppl 49.350
| epoch 2 step 63000 | 19988 batches | lr 0.000194 | ms/batch 495.95 | loss 3.87 | ppl 47.748
| epoch 2 step 63200 | 20188 batches | lr 0.000193 | ms/batch 496.49 | loss 3.85 | ppl 47.091
| epoch 2 step 63400 | 20388 batches | lr 0.000193 | ms/batch 495.47 | loss 3.91 | ppl 49.997
| epoch 2 step 63600 | 20588 batches | lr 0.000193 | ms/batch 496.90 | loss 3.86 | ppl 47.537
| epoch 2 step 63800 | 20788 batches | lr 0.000192 | ms/batch 495.20 | loss 3.90 | ppl 49.301
| epoch 2 step 64000 | 20988 batches | lr 0.000192 | ms/batch 496.10 | loss 3.86 | ppl 47.668
----------------------------------------------------------------------------------------------------
| Eval 16 at step 64000 | time: 1993.39s | valid loss 3.71 | valid ppl 40.796
----------------------------------------------------------------------------------------------------
| epoch 2 step 64200 | 21188 batches | lr 0.000192 | ms/batch 560.14 | loss 3.92 | ppl 50.365
| epoch 2 step 64400 | 21388 batches | lr 0.000191 | ms/batch 496.35 | loss 3.89 | ppl 49.015
| epoch 2 step 64600 | 21588 batches | lr 0.000191 | ms/batch 496.19 | loss 3.87 | ppl 48.032
| epoch 2 step 64800 | 21788 batches | lr 0.000191 | ms/batch 495.09 | loss 3.86 | ppl 47.602
| epoch 2 step 65000 | 21988 batches | lr 0.00019 | ms/batch 496.32 | loss 3.88 | ppl 48.326
| epoch 2 step 65200 | 22188 batches | lr 0.00019 | ms/batch 495.11 | loss 3.83 | ppl 46.153
| epoch 2 step 65400 | 22388 batches | lr 0.00019 | ms/batch 497.54 | loss 3.98 | ppl 53.359
| epoch 2 step 65600 | 22588 batches | lr 0.000189 | ms/batch 496.12 | loss 3.90 | ppl 49.293
| epoch 2 step 65800 | 22788 batches | lr 0.000189 | ms/batch 496.86 | loss 3.88 | ppl 48.452
| epoch 2 step 66000 | 22988 batches | lr 0.000189 | ms/batch 495.92 | loss 3.85 | ppl 47.180
| epoch 2 step 66200 | 23188 batches | lr 0.000188 | ms/batch 495.95 | loss 3.86 | ppl 47.666
| epoch 2 step 66400 | 23388 batches | lr 0.000188 | ms/batch 496.74 | loss 3.94 | ppl 51.185
| epoch 2 step 66600 | 23588 batches | lr 0.000188 | ms/batch 496.43 | loss 3.89 | ppl 48.950
| epoch 2 step 66800 | 23788 batches | lr 0.000187 | ms/batch 496.86 | loss 3.85 | ppl 46.882
| epoch 2 step 67000 | 23988 batches | lr 0.000187 | ms/batch 495.38 | loss 3.87 | ppl 48.173
| epoch 2 step 67200 | 24188 batches | lr 0.000187 | ms/batch 495.30 | loss 3.89 | ppl 48.739
| epoch 2 step 67400 | 24388 batches | lr 0.000186 | ms/batch 497.35 | loss 3.91 | ppl 50.042
| epoch 2 step 67600 | 24588 batches | lr 0.000186 | ms/batch 495.12 | loss 3.86 | ppl 47.671
| epoch 2 step 67800 | 24788 batches | lr 0.000186 | ms/batch 495.23 | loss 3.79 | ppl 44.108
| epoch 2 step 68000 | 24988 batches | lr 0.000185 | ms/batch 495.67 | loss 3.83 | ppl 46.003
----------------------------------------------------------------------------------------------------
| Eval 17 at step 68000 | time: 1994.89s | valid loss 3.67 | valid ppl 39.391
----------------------------------------------------------------------------------------------------
| epoch 2 step 68200 | 25188 batches | lr 0.000185 | ms/batch 558.31 | loss 3.90 | ppl 49.376
| epoch 2 step 68400 | 25388 batches | lr 0.000185 | ms/batch 494.18 | loss 3.86 | ppl 47.690
| epoch 2 step 68600 | 25588 batches | lr 0.000184 | ms/batch 495.95 | loss 3.90 | ppl 49.344
| epoch 2 step 68800 | 25788 batches | lr 0.000184 | ms/batch 495.95 | loss 3.81 | ppl 45.296
| epoch 2 step 69000 | 25988 batches | lr 0.000183 | ms/batch 495.79 | loss 3.83 | ppl 45.854
| epoch 2 step 69200 | 26188 batches | lr 0.000183 | ms/batch 495.66 | loss 3.87 | ppl 47.957
| epoch 2 step 69400 | 26388 batches | lr 0.000183 | ms/batch 496.65 | loss 3.84 | ppl 46.513
| epoch 2 step 69600 | 26588 batches | lr 0.000182 | ms/batch 497.01 | loss 3.87 | ppl 48.017
| epoch 2 step 69800 | 26788 batches | lr 0.000182 | ms/batch 495.70 | loss 3.83 | ppl 46.210
| epoch 2 step 70000 | 26988 batches | lr 0.000182 | ms/batch 495.46 | loss 3.84 | ppl 46.512
| epoch 2 step 70200 | 27188 batches | lr 0.000181 | ms/batch 495.48 | loss 3.75 | ppl 42.498
| epoch 2 step 70400 | 27388 batches | lr 0.000181 | ms/batch 496.41 | loss 3.83 | ppl 45.953
| epoch 2 step 70600 | 27588 batches | lr 0.000181 | ms/batch 495.71 | loss 3.83 | ppl 45.834
| epoch 2 step 70800 | 27788 batches | lr 0.00018 | ms/batch 496.09 | loss 3.85 | ppl 46.892
| epoch 2 step 71000 | 27988 batches | lr 0.00018 | ms/batch 495.28 | loss 3.85 | ppl 46.962
| epoch 2 step 71200 | 28188 batches | lr 0.00018 | ms/batch 496.89 | loss 3.93 | ppl 51.107
| epoch 2 step 71400 | 28388 batches | lr 0.000179 | ms/batch 496.23 | loss 3.88 | ppl 48.250
| epoch 2 step 71600 | 28588 batches | lr 0.000179 | ms/batch 497.05 | loss 3.87 | ppl 47.972
| epoch 2 step 71800 | 28788 batches | lr 0.000179 | ms/batch 496.56 | loss 3.83 | ppl 46.151
| epoch 2 step 72000 | 28988 batches | lr 0.000178 | ms/batch 496.29 | loss 3.81 | ppl 45.268
----------------------------------------------------------------------------------------------------
| Eval 18 at step 72000 | time: 1994.51s | valid loss 3.65 | valid ppl 38.435
----------------------------------------------------------------------------------------------------
| epoch 2 step 72200 | 29188 batches | lr 0.000178 | ms/batch 558.09 | loss 3.81 | ppl 44.979
| epoch 2 step 72400 | 29388 batches | lr 0.000178 | ms/batch 496.79 | loss 3.88 | ppl 48.647
| epoch 2 step 72600 | 29588 batches | lr 0.000177 | ms/batch 495.04 | loss 3.82 | ppl 45.625
| epoch 2 step 72800 | 29788 batches | lr 0.000177 | ms/batch 497.27 | loss 3.86 | ppl 47.306
| epoch 2 step 73000 | 29988 batches | lr 0.000176 | ms/batch 496.16 | loss 3.93 | ppl 50.787
| epoch 2 step 73200 | 30188 batches | lr 0.000176 | ms/batch 496.80 | loss 3.90 | ppl 49.601
| epoch 2 step 73400 | 30388 batches | lr 0.000176 | ms/batch 494.70 | loss 3.80 | ppl 44.753
| epoch 2 step 73600 | 30588 batches | lr 0.000175 | ms/batch 497.47 | loss 3.90 | ppl 49.536
| epoch 2 step 73800 | 30788 batches | lr 0.000175 | ms/batch 497.58 | loss 3.86 | ppl 47.410
| epoch 2 step 74000 | 30988 batches | lr 0.000175 | ms/batch 496.34 | loss 3.85 | ppl 47.135
| epoch 2 step 74200 | 31188 batches | lr 0.000174 | ms/batch 497.49 | loss 3.78 | ppl 43.905
| epoch 2 step 74400 | 31388 batches | lr 0.000174 | ms/batch 497.79 | loss 3.91 | ppl 49.900
| epoch 2 step 74600 | 31588 batches | lr 0.000174 | ms/batch 495.27 | loss 3.84 | ppl 46.505
| epoch 2 step 74800 | 31788 batches | lr 0.000173 | ms/batch 497.41 | loss 3.86 | ppl 47.500
| epoch 2 step 75000 | 31988 batches | lr 0.000173 | ms/batch 496.86 | loss 3.83 | ppl 46.289
| epoch 2 step 75200 | 32188 batches | lr 0.000172 | ms/batch 496.06 | loss 3.83 | ppl 46.287
| epoch 2 step 75400 | 32388 batches | lr 0.000172 | ms/batch 496.56 | loss 3.82 | ppl 45.578
| epoch 2 step 75600 | 32588 batches | lr 0.000172 | ms/batch 495.09 | loss 3.78 | ppl 43.878
| epoch 2 step 75800 | 32788 batches | lr 0.000171 | ms/batch 493.97 | loss 3.72 | ppl 41.282
| epoch 2 step 76000 | 32988 batches | lr 0.000171 | ms/batch 497.01 | loss 3.81 | ppl 45.076
----------------------------------------------------------------------------------------------------
| Eval 19 at step 76000 | time: 1995.86s | valid loss 3.64 | valid ppl 37.967
----------------------------------------------------------------------------------------------------
| epoch 2 step 76200 | 33188 batches | lr 0.000171 | ms/batch 559.47 | loss 3.87 | ppl 47.730
| epoch 2 step 76400 | 33388 batches | lr 0.00017 | ms/batch 494.65 | loss 3.76 | ppl 43.043
| epoch 2 step 76600 | 33588 batches | lr 0.00017 | ms/batch 495.93 | loss 3.82 | ppl 45.561
| epoch 2 step 76800 | 33788 batches | lr 0.00017 | ms/batch 495.31 | loss 3.83 | ppl 46.026
| epoch 2 step 77000 | 33988 batches | lr 0.000169 | ms/batch 496.22 | loss 3.83 | ppl 46.269
| epoch 2 step 77200 | 34188 batches | lr 0.000169 | ms/batch 496.25 | loss 3.88 | ppl 48.310
| epoch 2 step 77400 | 34388 batches | lr 0.000168 | ms/batch 496.43 | loss 3.85 | ppl 46.997
| epoch 2 step 77600 | 34588 batches | lr 0.000168 | ms/batch 496.44 | loss 3.82 | ppl 45.519
| epoch 2 step 77800 | 34788 batches | lr 0.000168 | ms/batch 494.99 | loss 3.71 | ppl 40.734
| epoch 2 step 78000 | 34988 batches | lr 0.000167 | ms/batch 495.23 | loss 3.77 | ppl 43.317
| epoch 2 step 78200 | 35188 batches | lr 0.000167 | ms/batch 497.83 | loss 3.82 | ppl 45.826
| epoch 2 step 78400 | 35388 batches | lr 0.000167 | ms/batch 498.12 | loss 3.86 | ppl 47.646
| epoch 2 step 78600 | 35588 batches | lr 0.000166 | ms/batch 496.14 | loss 3.81 | ppl 45.088
| epoch 2 step 78800 | 35788 batches | lr 0.000166 | ms/batch 495.79 | loss 3.84 | ppl 46.738
| epoch 2 step 79000 | 35988 batches | lr 0.000165 | ms/batch 495.83 | loss 3.77 | ppl 43.227
| epoch 2 step 79200 | 36188 batches | lr 0.000165 | ms/batch 495.80 | loss 3.85 | ppl 47.068
| epoch 2 step 79400 | 36388 batches | lr 0.000165 | ms/batch 495.60 | loss 3.84 | ppl 46.464
| epoch 2 step 79600 | 36588 batches | lr 0.000164 | ms/batch 497.03 | loss 3.82 | ppl 45.647
| epoch 2 step 79800 | 36788 batches | lr 0.000164 | ms/batch 496.96 | loss 3.85 | ppl 47.228
| epoch 2 step 80000 | 36988 batches | lr 0.000164 | ms/batch 495.58 | loss 3.78 | ppl 43.972
----------------------------------------------------------------------------------------------------
| Eval 20 at step 80000 | time: 1994.91s | valid loss 3.62 | valid ppl 37.500
----------------------------------------------------------------------------------------------------
| epoch 2 step 80200 | 37188 batches | lr 0.000163 | ms/batch 557.73 | loss 3.76 | ppl 42.794
| epoch 2 step 80400 | 37388 batches | lr 0.000163 | ms/batch 496.61 | loss 3.83 | ppl 45.896
| epoch 2 step 80600 | 37588 batches | lr 0.000163 | ms/batch 497.21 | loss 3.84 | ppl 46.540
| epoch 2 step 80800 | 37788 batches | lr 0.000162 | ms/batch 495.64 | loss 3.84 | ppl 46.398
| epoch 2 step 81000 | 37988 batches | lr 0.000162 | ms/batch 495.25 | loss 3.76 | ppl 42.921
| epoch 2 step 81200 | 38188 batches | lr 0.000161 | ms/batch 495.68 | loss 3.82 | ppl 45.510
| epoch 2 step 81400 | 38388 batches | lr 0.000161 | ms/batch 495.53 | loss 3.83 | ppl 46.083
| epoch 2 step 81600 | 38588 batches | lr 0.000161 | ms/batch 496.25 | loss 3.83 | ppl 46.291
| epoch 2 step 81800 | 38788 batches | lr 0.00016 | ms/batch 496.61 | loss 3.82 | ppl 45.570
| epoch 2 step 82000 | 38988 batches | lr 0.00016 | ms/batch 496.02 | loss 3.80 | ppl 44.920
| epoch 2 step 82200 | 39188 batches | lr 0.000159 | ms/batch 495.76 | loss 3.84 | ppl 46.452
| epoch 2 step 82400 | 39388 batches | lr 0.000159 | ms/batch 496.01 | loss 3.80 | ppl 44.545
| epoch 2 step 82600 | 39588 batches | lr 0.000159 | ms/batch 496.40 | loss 3.72 | ppl 41.278
| epoch 2 step 82800 | 39788 batches | lr 0.000158 | ms/batch 495.66 | loss 3.76 | ppl 43.083
| epoch 2 step 83000 | 39988 batches | lr 0.000158 | ms/batch 496.01 | loss 3.72 | ppl 41.378
| epoch 2 step 83200 | 40188 batches | lr 0.000158 | ms/batch 496.24 | loss 3.84 | ppl 46.708
| epoch 2 step 83400 | 40388 batches | lr 0.000157 | ms/batch 497.08 | loss 3.82 | ppl 45.394
| epoch 2 step 83600 | 40588 batches | lr 0.000157 | ms/batch 496.06 | loss 3.78 | ppl 44.016
| epoch 2 step 83800 | 40788 batches | lr 0.000156 | ms/batch 494.97 | loss 3.80 | ppl 44.576
| epoch 2 step 84000 | 40988 batches | lr 0.000156 | ms/batch 496.15 | loss 3.83 | ppl 46.051
----------------------------------------------------------------------------------------------------
| Eval 21 at step 84000 | time: 1994.58s | valid loss 3.60 | valid ppl 36.595
----------------------------------------------------------------------------------------------------
| epoch 2 step 84200 | 41188 batches | lr 0.000156 | ms/batch 559.50 | loss 3.82 | ppl 45.687
| epoch 2 step 84400 | 41388 batches | lr 0.000155 | ms/batch 497.53 | loss 3.85 | ppl 46.897
| epoch 2 step 84600 | 41588 batches | lr 0.000155 | ms/batch 496.57 | loss 3.84 | ppl 46.431
| epoch 2 step 84800 | 41788 batches | lr 0.000155 | ms/batch 497.04 | loss 3.81 | ppl 45.316
| epoch 2 step 85000 | 41988 batches | lr 0.000154 | ms/batch 496.63 | loss 3.78 | ppl 43.897
| epoch 2 step 85200 | 42188 batches | lr 0.000154 | ms/batch 496.84 | loss 3.83 | ppl 46.270
| epoch 2 step 85400 | 42388 batches | lr 0.000153 | ms/batch 496.37 | loss 3.76 | ppl 42.737
| epoch 2 step 85600 | 42588 batches | lr 0.000153 | ms/batch 496.77 | loss 3.83 | ppl 46.015
| epoch 2 step 85800 | 42788 batches | lr 0.000153 | ms/batch 496.07 | loss 3.82 | ppl 45.791
| epoch 2 step 86000 | 42988 batches | lr 0.000152 | ms/batch 496.34 | loss 3.75 | ppl 42.588
| epoch 3 step 86200 | 176 batches | lr 0.000152 | ms/batch 493.99 | loss 3.73 | ppl 41.515
| epoch 3 step 86400 | 376 batches | lr 0.000152 | ms/batch 495.98 | loss 3.81 | ppl 45.185
| epoch 3 step 86600 | 576 batches | lr 0.000151 | ms/batch 495.77 | loss 3.74 | ppl 42.030
| epoch 3 step 86800 | 776 batches | lr 0.000151 | ms/batch 496.16 | loss 3.77 | ppl 43.477
| epoch 3 step 87000 | 976 batches | lr 0.00015 | ms/batch 496.08 | loss 3.77 | ppl 43.371
| epoch 3 step 87200 | 1176 batches | lr 0.00015 | ms/batch 496.94 | loss 3.77 | ppl 43.194
| epoch 3 step 87400 | 1376 batches | lr 0.00015 | ms/batch 494.88 | loss 3.75 | ppl 42.443
| epoch 3 step 87600 | 1576 batches | lr 0.000149 | ms/batch 494.32 | loss 3.75 | ppl 42.433
| epoch 3 step 87800 | 1776 batches | lr 0.000149 | ms/batch 495.20 | loss 3.74 | ppl 42.272
| epoch 3 step 88000 | 1976 batches | lr 0.000148 | ms/batch 495.05 | loss 3.63 | ppl 37.799
----------------------------------------------------------------------------------------------------
| Eval 22 at step 88000 | time: 1994.60s | valid loss 3.58 | valid ppl 35.821
----------------------------------------------------------------------------------------------------
| epoch 3 step 88200 | 2176 batches | lr 0.000148 | ms/batch 559.31 | loss 3.80 | ppl 44.774
| epoch 3 step 88400 | 2376 batches | lr 0.000148 | ms/batch 495.26 | loss 3.73 | ppl 41.521
| epoch 3 step 88600 | 2576 batches | lr 0.000147 | ms/batch 496.26 | loss 3.75 | ppl 42.511
| epoch 3 step 88800 | 2776 batches | lr 0.000147 | ms/batch 496.64 | loss 3.81 | ppl 45.369
| epoch 3 step 89000 | 2976 batches | lr 0.000146 | ms/batch 496.30 | loss 3.79 | ppl 44.406
| epoch 3 step 89200 | 3176 batches | lr 0.000146 | ms/batch 496.92 | loss 3.80 | ppl 44.569
| epoch 3 step 89400 | 3376 batches | lr 0.000146 | ms/batch 497.01 | loss 3.72 | ppl 41.090
| epoch 3 step 89600 | 3576 batches | lr 0.000145 | ms/batch 495.41 | loss 3.69 | ppl 39.899
| epoch 3 step 89800 | 3776 batches | lr 0.000145 | ms/batch 496.24 | loss 3.77 | ppl 43.355
| epoch 3 step 90000 | 3976 batches | lr 0.000145 | ms/batch 494.38 | loss 3.72 | ppl 41.354
| epoch 3 step 90200 | 4176 batches | lr 0.000144 | ms/batch 494.91 | loss 3.75 | ppl 42.478
| epoch 3 step 90400 | 4376 batches | lr 0.000144 | ms/batch 495.14 | loss 3.69 | ppl 40.209
| epoch 3 step 90600 | 4576 batches | lr 0.000143 | ms/batch 495.99 | loss 3.77 | ppl 43.557
| epoch 3 step 90800 | 4776 batches | lr 0.000143 | ms/batch 496.55 | loss 3.77 | ppl 43.368
| epoch 3 step 91000 | 4976 batches | lr 0.000143 | ms/batch 495.50 | loss 3.77 | ppl 43.277
| epoch 3 step 91200 | 5176 batches | lr 0.000142 | ms/batch 496.48 | loss 3.70 | ppl 40.385
| epoch 3 step 91400 | 5376 batches | lr 0.000142 | ms/batch 496.20 | loss 3.79 | ppl 44.400
| epoch 3 step 91600 | 5576 batches | lr 0.000141 | ms/batch 497.60 | loss 3.78 | ppl 43.714
| epoch 3 step 91800 | 5776 batches | lr 0.000141 | ms/batch 497.35 | loss 3.78 | ppl 43.996
| epoch 3 step 92000 | 5976 batches | lr 0.000141 | ms/batch 495.02 | loss 3.71 | ppl 40.849
----------------------------------------------------------------------------------------------------
| Eval 23 at step 92000 | time: 1994.78s | valid loss 3.56 | valid ppl 35.327
----------------------------------------------------------------------------------------------------
| epoch 3 step 92200 | 6176 batches | lr 0.00014 | ms/batch 559.91 | loss 3.81 | ppl 45.102
| epoch 3 step 92400 | 6376 batches | lr 0.00014 | ms/batch 496.58 | loss 3.77 | ppl 43.505
| epoch 3 step 92600 | 6576 batches | lr 0.000139 | ms/batch 496.09 | loss 3.73 | ppl 41.623
| epoch 3 step 92800 | 6776 batches | lr 0.000139 | ms/batch 495.74 | loss 3.75 | ppl 42.649
| epoch 3 step 93000 | 6976 batches | lr 0.000139 | ms/batch 496.82 | loss 3.76 | ppl 42.786
| epoch 3 step 93200 | 7176 batches | lr 0.000138 | ms/batch 496.71 | loss 3.69 | ppl 40.153
| epoch 3 step 93400 | 7376 batches | lr 0.000138 | ms/batch 496.65 | loss 3.78 | ppl 43.863
| epoch 3 step 93600 | 7576 batches | lr 0.000138 | ms/batch 496.07 | loss 3.73 | ppl 41.884
| epoch 3 step 93800 | 7776 batches | lr 0.000137 | ms/batch 497.40 | loss 3.81 | ppl 44.955
| epoch 3 step 94000 | 7976 batches | lr 0.000137 | ms/batch 497.29 | loss 3.74 | ppl 42.176
----------------------------------------------------------------------------------------------------
Exiting from training early
====================================================================================================
| End of training | test loss 3.62 | test ppl 37.262
====================================================================================================