Validation perplexity is 146.71 at the end of training (24 epochs)

Question

Validation perplexity is 146.71 at the end of training (24 epochs)

ygoncharov opened this issue 8 years ago · comments

(it should get ~82 on valid and ~79 on test)

$ python main.py --dataset ptb

.....

epoch: [24] [ 250/ 265] loss: 3.466149
Valid: loss: 5.225354, perplexity: 185.927017
{'perplexity': 83.749542031012467, 'epoch': 24, 'valid_perplexity': 146.71359295576036, 'learning_rate': 0.5}
[] Saving checkpoints...
Test: loss: 4.836956, perplexity: 126.084908
[] Test loss: 4.954320, perplexity: 141.786226

Taehoon Kim · Answer 1 · Mon Feb 08 2016 07:07:21 GMT+0800 (China Standard Time)

I'm working on this issue and I don't think the current implementation is different from the original model. I checked the model validity by comparing the losses of a single batch during the early epochs and there are no differences. Also, I checked the perplexity of training set goes down to 90.

One thing I'm working on is to change the testing algorithm which is different from the original. The original code calculate the whole perplexity of all test data in a single forward pass but this repo calculates the perplexity of test data same as the training data, which is batch averaged perplexity. This will reduce the perplexity in some way.. but not sure this will make the comparable results.

If you find any other differences, feel free to share it to me 😄

Yoon Kim · Answer 2 · Mon Feb 08 2016 11:09:01 GMT+0800 (China Standard Time)

Cool stuff!
I noticed on the README that you are using 100/150 hidden units for small/large models respectively. I actually use 300/650 hidden units, so this might explain the difference in performance. Also, it seems like you are using RMSProp? I've found vanilla SGD with starting learning rate of 1.0 (halved every time the perplexity does not improve on dev set) to work much better than other optimization methods, including RMSProp.

Hope this helps.

Taehoon Kim · Answer 3 · Mon Feb 08 2016 11:29:51 GMT+0800 (China Standard Time)

@yoonkim Hi! Thanks for sharing your great work and I enjoyed the paper very well! Actually, README is an old one which I forgot to update it (now I fixed it) and the code already uses same hidden units, optimizer, and decay as you mentioned..

Yoon Kim · Answer 4 · Mon Feb 08 2016 12:43:23 GMT+0800 (China Standard Time)

Ah ok! Few other things may be:

batch size
parameter initialization

Taehoon Kim · Answer 5 · Mon Feb 08 2016 13:04:16 GMT+0800 (China Standard Time)

Thanks! I'll dig into those things and how was the perplexity on training set after the training?

Yoon Kim · Answer 6 · Mon Feb 08 2016 13:06:28 GMT+0800 (China Standard Time)

I think it should be a lot lower. I don't recall the numbers exactly but since the dataset is small and the model has a lot of capacity (even with dropout) training PPL should be well below 50.

Nilesh Kulkarni · Answer 7 · Thu Jun 02 2016 09:11:44 GMT+0800 (China Standard Time)

@carpedm20 Hi,
Did you find any possibles pointers on this issue of high test perplexity? I was trying to debug it and any help would be appreciated.

SangSeo Yoo · Answer 8 · Mon Jul 04 2016 14:16:18 GMT+0800 (China Standard Time)

@carpedm20 Hello, thanks for sharing your code in github. I also noticed that the problem of getting high perplexity on PTB test set is still ongoing. Have you had a chance to deal with this issue or any pointer to fix it? Thanks in advance.

Taehoon Kim · Answer 9 · Mon Jul 04 2016 17:50:04 GMT+0800 (China Standard Time)

@nileshkulkarni @yss4 No, I couldn't find the reason of problem yet and I'm not working on this project now. But if you share me any weird codes that is different from the original paper, please share it and I'll take a look at it.

Mike Kroutikov · Answer 10 · Sat Sep 17 2016 01:09:35 GMT+0800 (China Standard Time)

@carpedm20 This implementation is NOT identical to the original.

Interested reader can have a look at my code here:
https://github.com/mkroutikov/tf-lstm-char-cnn
that does reproduce Yoon Kim's redult in TF.

Nicole · Answer 11 · Tue Sep 20 2016 11:34:38 GMT+0800 (China Standard Time)

I ran the code yesterday and received a result of 156.097 averaged validation PPL, 149.565 averaged test PPL. So I am reading your code and the original.The first different thing I found was the criterion, yours is CE while the original is NLL.Does it matter?

guanghuixu · Answer 12 · Wed Oct 26 2016 12:56:29 GMT+0800 (China Standard Time)

Thanks for sharing your code. I want to know how can I train a model in word_level? I found you code has the things like ( use_char = Ture, use_word = False). Is it useful to adjust the 'use_word = Ture'? Looking forward to your answer, thank you.