tensorflow / skflow

Simplified interface for TensorFlow (mimicking Scikit Learn) for Deep Learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

zero indicates end-of-sequence and unknown symbol in VocabularyProcessor

mheilman opened this issue · comments

Currently, in the VocabularyProcessor for text input, zero is used both as a padding symbol and also as the unknown word symbol.

This seems problematic since the padding symbol would be useful for inferring the sequence length (cf. #141).

It seems like it would be better to have the unknown token map to 1 (or maybe vocab_size) instead.

You are right. I'll update to 1 (I would prefer to have unknown token stable between different datasets etc - in case of special handling).

Note: code right now in the move to tensorflow.contrib.skflow - so I'll update already there.

Hi, it looks like this is still an issue in tensorflow.contrib.learn. Shall I file an issue (or PR) there?

I think this could be addressed by changing this line to self._mapping = {None: 0, unknown_token: 1} (note: just having {unknown_token: 1} wouldn't work because the indices for new terms are chosen during fitting by len(self._mapping)).

Example using tensorflow 0.8.0:

In [115]: vp = VocabularyProcessor(10)

In [116]: vp.fit(["a dog ran in the park"])
Out[116]: <tensorflow.contrib.learn.python.learn.preprocessing.text.VocabularyProcessor at 0x118a3ac18>

In [117]: list(vp.transform(["a dog ran in the park"]))
Out[117]: [array([1, 2, 3, 4, 5, 6, 0, 0, 0, 0])]

In [118]: list(vp.transform(["a cat ran in the park"]))
Out[118]: [array([1, 0, 3, 4, 5, 6, 0, 0, 0, 0])]

Yes, This is still an issue. I want to know, Will someone change this code in future? Or They are happy with the current implementation of treating unknown tokens and padding as same?

Sorry, this code was removed from Tensorflow in favor of processing with either tools in scikit-learn or future tools in Tensorflow.