zero indicates end-of-sequence and unknown symbol in VocabularyProcessor

Question

zero indicates end-of-sequence and unknown symbol in VocabularyProcessor

mheilman opened this issue 8 years ago · comments

Currently, in the VocabularyProcessor for text input, zero is used both as a padding symbol and also as the unknown word symbol.

This seems problematic since the padding symbol would be useful for inferring the sequence length (cf. #141).

It seems like it would be better to have the unknown token map to 1 (or maybe vocab_size) instead.

Illia Polosukhin · Answer 1 · Fri Mar 11 2016 08:53:07 GMT+0800 (China Standard Time)

You are right. I'll update to 1 (I would prefer to have unknown token stable between different datasets etc - in case of special handling).

Note: code right now in the move to tensorflow.contrib.skflow - so I'll update already there.

Michael Heilman · Answer 2 · Fri Apr 29 2016 02:25:31 GMT+0800 (China Standard Time)

Hi, it looks like this is still an issue in tensorflow.contrib.learn. Shall I file an issue (or PR) there?

I think this could be addressed by changing this line to self._mapping = {None: 0, unknown_token: 1} (note: just having {unknown_token: 1} wouldn't work because the indices for new terms are chosen during fitting by len(self._mapping)).

Example using tensorflow 0.8.0:

In [115]: vp = VocabularyProcessor(10)

In [116]: vp.fit(["a dog ran in the park"])
Out[116]: <tensorflow.contrib.learn.python.learn.preprocessing.text.VocabularyProcessor at 0x118a3ac18>

In [117]: list(vp.transform(["a dog ran in the park"]))
Out[117]: [array([1, 2, 3, 4, 5, 6, 0, 0, 0, 0])]

In [118]: list(vp.transform(["a cat ran in the park"]))
Out[118]: [array([1, 0, 3, 4, 5, 6, 0, 0, 0, 0])]

Alok Nayak · Answer 3 · Tue Oct 25 2016 21:02:46 GMT+0800 (China Standard Time)

Yes, This is still an issue. I want to know, Will someone change this code in future? Or They are happy with the current implementation of treating unknown tokens and padding as same?

Illia Polosukhin · Answer 4 · Wed Oct 26 2016 12:28:54 GMT+0800 (China Standard Time)

Sorry, this code was removed from Tensorflow in favor of processing with either tools in scikit-learn or future tools in Tensorflow.