akb89 / word2vec

Re-implementation of Word2Vec using Tensorflow v2 Estimators and Datasets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unicode error

giosal opened this issue · comments

Hello,

while training word2vec on sample english wiki data, I'm getting following error:
w2v train --data enwiki.20190120.sample10.0.balanced.txt.7z --outputdir output
2020-03-23 15:06:25,712 - word2vec.main - INFO - Training Tensorflow implementation of Word2Vec
2020-03-23 15:06:25,714 - word2vec.estimators.word2vec - INFO - Building vocabulary from file enwiki.20190120.sample10.0.balanced.txt.7z
2020-03-23 15:06:25,714 - word2vec.estimators.word2vec - INFO - Loading word counts... Traceback (most recent call last):
File "/usr/local/bin/w2v", line 11, in <module>
load_entry_point('tf-word2vec', 'console_scripts', 'w2v')()
File "/home/giosal/word2vec/word2vec/main.py", line 126, in main
args.func(args)
File "/home/giosal/word2vec/word2vec/main.py", line 47, in _train
w2v.build_vocab(args.datafile, vocab_filepath, args.min_count)
File "/home/giosal/word2vec/word2vec/estimators/word2vec.py", line 49, in build_vocab
for line in data_stream:
File "/usr/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 2: invalid start byte