seq2seq guide: tokenizer crashes
genekogan opened this issue · comments
getting this on source_tokenizer.fit_on_texts(en_texts)
.
TypeError Traceback (most recent call last)
in ()
4
5 source_tokenizer = Tokenizer(max_vocab_size, filters=filter_chars)
----> 6 source_tokenizer.fit_on_texts(en_texts)
7 target_tokenizer = Tokenizer(max_vocab_size, filters=filter_chars)
8 target_tokenizer.fit_on_texts(de_texts)
/usr/local/lib/python2.7/site-packages/Keras-1.0.6-py2.7.egg/keras/preprocessing/text.pyc in fit_on_texts(self, texts)
85 for text in texts:
86 self.document_count += 1
---> 87 seq = text if self.char_level else text_to_word_sequence(text, self.filters, self.lower, self.split)
88 for w in seq:
89 if w in self.word_counts:
/usr/local/lib/python2.7/site-packages/Keras-1.0.6-py2.7.egg/keras/preprocessing/text.pyc in text_to_word_sequence(text, filters, lower, split)
30 if lower:
31 text = text.lower()
---> 32 text = text.translate(maketrans(filters, split*len(filters)))
33 seq = text.split(split)
34 return [_f for _f in seq if _f]
TypeError: character mapping must return integer, None or unicode
looks like a string encoding issue with python 2.7...works fine on 3.5.
seems to be an open issue with keras:
keras-team/keras#1072