ml4a / ml4a

A python library and collection of notebooks for making art with machine learning.

Home Page:https://ml4a.net

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

seq2seq guide: tokenizer crashes

genekogan opened this issue · comments

getting this on source_tokenizer.fit_on_texts(en_texts).

TypeError Traceback (most recent call last)
in ()
4
5 source_tokenizer = Tokenizer(max_vocab_size, filters=filter_chars)
----> 6 source_tokenizer.fit_on_texts(en_texts)
7 target_tokenizer = Tokenizer(max_vocab_size, filters=filter_chars)
8 target_tokenizer.fit_on_texts(de_texts)

/usr/local/lib/python2.7/site-packages/Keras-1.0.6-py2.7.egg/keras/preprocessing/text.pyc in fit_on_texts(self, texts)
85 for text in texts:
86 self.document_count += 1
---> 87 seq = text if self.char_level else text_to_word_sequence(text, self.filters, self.lower, self.split)
88 for w in seq:
89 if w in self.word_counts:

/usr/local/lib/python2.7/site-packages/Keras-1.0.6-py2.7.egg/keras/preprocessing/text.pyc in text_to_word_sequence(text, filters, lower, split)
30 if lower:
31 text = text.lower()
---> 32 text = text.translate(maketrans(filters, split*len(filters)))
33 seq = text.split(split)
34 return [_f for _f in seq if _f]

TypeError: character mapping must return integer, None or unicode

looks like a string encoding issue with python 2.7...works fine on 3.5.
seems to be an open issue with keras:
keras-team/keras#1072