Generating vocabulary only from the training set

Question

Generating vocabulary only from the training set

sxs4337 opened this issue 8 years ago · comments

The vocabulary should be generated only using the training data.
Currently, in function-
https://github.com/tsenghungchen/SA-tensorflow/blob/master/Att.py#L370 , the input is "captions" which is generated from all data- train+val+test.
Ideally, the network should not be fed any words from the test set (any unseen new words in testing to the network should be just <unknown_word> for evaluation).
Thanks.

Paul Chen · Answer 1 · Wed Aug 17 2016 21:40:51 GMT+0800 (China Standard Time)

Yeah, you're right. Thank you for pointing it out. I'll update the code.