np.loadtxt issue in Python3
Tanglinus opened this issue · comments
Hi! Thank you for your code!
I would like to leave a note here for whoever wants to practice this framework in Python3.
The numpy.loadtxt (also numpy.genfromtxt) module doesn't work (inputs the *vectors.txt as empty files and causes an index error) in Python3 may due to the unicode problem.
The simplest way to solve this issue is to replace this IO part with pandas library.
Besides, all the string objects need to be either replacing by str.encoding('utf-8') or just dropping str.uncoding('utf-8').
Everything else works well! Thank you for your great work again!
Hi Tanglinus, thanks for the note! The problem may be related to the default value for the 'comments' option in numpy.loadtxt. In Python 3 by default '#' is interpreted as a comment, which may cause problems if you have words in your corpus containing this string. I now changed it everywhere to 'comments=None'. I hope this solves the problem.
Hello @Tanglinus @Garrafao
I ecountered a similar error here
Traceback (most recent call last):
File "space_creation/txt2w2v.py", line 54, in
main()
File "space_creation/txt2w2v.py", line 41, in main
space_array = np.loadtxt(spacePrefix + '.txt', dtype=object, delimiter=' ', skiprows=0, comments=None, encoding='utf-8')
File "/home/jinan/.conda/envs/TR/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1141, in loadtxt
for x in read_data(_loadtxt_chunksize):
File "/home/jinan/.conda/envs/TR/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1056, in read_data
for i, line in enumerate(line_iter):
File "/home/jinan/.conda/envs/TR/lib/python2.7/codecs.py", line 314, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 2073-2074: invalid continuation byte
Is it the same error you had? Do you mean replacing np.loadtxt
with pandas
in txt2w2v.py
?
Hi Jinan, could you please post lines 2073-2074 from your input file here?