Using test data for training

Question

Using test data for training

Keramatfar opened this issue 7 years ago · comments

Thanks to Author,
In the section "Working with bag of words", the algorithm use all the data to get vocabulary:
vocab_processor = learn.preprocessing.VocabularyProcessor(sentence_size, min_frequency=min_word_freq)
vocab_processor.fit_transform(texts)
but maybe it is not true to use test data to train the model.

Nick · Answer 1 · Thu Mar 22 2018 04:08:06 GMT+0800 (China Standard Time)

Hi @Keramatfar , Thanks for the question. I should add a section explaining why we can use the whole dataset here. I'll add a formal explanation in the notebook during my code rewrite over the next few months (so I'll keep this issue open for now).

But a short explanation is that the word vector methods are not really using the target information to train the embeddings. Because of this you can think of the word vector methods as a kind of "unsupervised" method. Technically, the word vector methods are supervised, but they generate the labeled target as a sequence of tokens in a token window. But they don't use the overall-problem specific y-targets (categories of documents) to train the embeddings. Because of that, they can use the whole text. Plus it also allows us to observe the whole vocabulary in the data (increasing the observed word counts).

I hope that helps.

Keramatfar · Answer 2 · Thu Mar 22 2018 18:05:32 GMT+0800 (China Standard Time)

@nfmcclure, in real world when training a model we don't have access to neither test texts nor test labels.

Nick · Answer 3 · Sat Mar 24 2018 00:54:33 GMT+0800 (China Standard Time)

Hi @Keramatfar,

I'm still not sure on the problem. In any problem, real or not, you have a set of data (just one set).

Then you split the dataset yourself into training and test sets. These are manually created by the problem from the single dataset that you have.

In most ML problems, you train the algorithm on the 'training set' and test it on the 'test set'. Again, both of these sets come from the original set that you decide how to split up.

Here is a similar question with some good responses, that I recommend reading through:
https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set

If you have further questions about the code itself, bugs you have observed, or any features you want to see (with specifics), feel free to bring those up in a separate issue.

Github issues are not meant to address general math or high-level machine learning concepts. I'm going to close this issue for now.