Testing data and train data are repeated in Shakespeare

Question

Testing data and train data are repeated in Shakespeare

zliangak opened this issue 5 years ago · comments

Using the splitting method provided in the paper_experiment, I found that the testing data appeals exactly in the training data, resulting in a 100% testing accuracy if you use SGD+2 layers LSTM to train it.

For example, in the training set, 'THE_FIRST_PART_OF_HENRY_THE_SIXTH_MORTIMER''s words are:
[..., g age, Let dying Mortimer here rest himself. Even like a man new haled from the ',
' age, Let dying Mortimer here rest himself. Even like a man new haled from the r',
'age, Let dying Mortimer here rest himself. Even like a man new haled from the ra',
'ge, Let dying Mortimer here rest himself. Even like a man new haled from the rac',
'e, Let dying Mortimer here rest himself. Even like a man new haled from the rack',
', Let dying Mortimer here rest himself. Even like a man new haled from the rack,',
'Let dying Mortimer here rest himself. Even like a man new haled from the rack, S',
'et dying Mortimer here rest himself. Even like a man new haled from the rack, So',
't dying Mortimer here rest himself. Even like a man new haled from the rack, So ',
' dying Mortimer here rest himself. Even like a man new haled from the rack, So f',
'dying Mortimer here rest himself. Even like a man new haled from the rack, So fa',
'ying Mortimer here rest himself. Even like a man new haled from the rack, So far',
'ing Mortimer here rest himself. Even like a man new haled from the rack, So fare',
'ng Mortimer here rest himself. Even like a man new haled from the rack, So fare ',
'g Mortimer here rest himself. Even like a man new haled from the rack, So fare m',...]

and in the testing set, you can find the exact santence:

[' Let dying Mortimer here rest himself. Even like a man new haled from the rack, ']

My model will get a testing accuracy about 1 in about 35 epochs.

Sebastian Caldas · Answer 1 · Mon Dec 02 2019 13:50:32 GMT+0800 (China Standard Time)

You are right. The split shouldn't be at random but temporal, and we should be careful to avoid this. We will work on fixing this.

Can you provide us with the parameters that you used to obtain the testing accuracy (learning rate, size of the layers, etc.)?

Thank you.

ZhicongLiang · Answer 2 · Mon Dec 02 2019 14:51:37 GMT+0800 (China Standard Time)

I am using pytorch. All info is shown in the file below. One different of my model is that I feed all the hidden unit (instead of the last one) in to the linear layer. When I use your implementation of LSTM, i.e. only feeding the last hidden unit, I can still get a pretty high accuracy given enough training epochs.

Hope this can be fixed soon. It is a very helpful dataset. Thanks.

test.pdf

ZhicongLiang · Answer 3 · Wed Dec 04 2019 16:49:43 GMT+0800 (China Standard Time)

Sorry, there is a mistake of my implementation in the cell-6 of "test.pdf". When I define test_loader, I should use dataset=test_set instead of dataset=train_set.

And this would not affect the existence of the problem regarding this issue.

Regards,

Sebastian Caldas · Answer 4 · Fri Dec 06 2019 03:13:24 GMT+0800 (China Standard Time)

Sorry, I'm confused by your update. Does this mean the issue remains?

ZhicongLiang · Answer 5 · Fri Dec 06 2019 12:38:44 GMT+0800 (China Standard Time)

Yes, the issue remains.

Sebastian Caldas · Answer 6 · Thu Mar 12 2020 03:30:19 GMT+0800 (China Standard Time)

I have modified the train/test splits for Shakespeare. They are now temporally split, and samples that would leak any test information into the training set are ignored. This means that, if the last training sample happens at index i, the first test sample happens at index i + seq_len. We use seq_len as 80.

A side effect of this change is that some users now don't have any test samples, and have to be dropped from training.

ZhicongLiang · Answer 7 · Thu Mar 12 2020 13:18:52 GMT+0800 (China Standard Time)

Cool! Thanks a lot!