np.loadtxt issue in Python3

Question

np.loadtxt issue in Python3

Tanglinus opened this issue 5 years ago · comments

Hi! Thank you for your code!
I would like to leave a note here for whoever wants to practice this framework in Python3.

The numpy.loadtxt (also numpy.genfromtxt) module doesn't work (inputs the *vectors.txt as empty files and causes an index error) in Python3 may due to the unicode problem.
The simplest way to solve this issue is to replace this IO part with pandas library.
Besides, all the string objects need to be either replacing by str.encoding('utf-8') or just dropping str.uncoding('utf-8').

Everything else works well! Thank you for your great work again!

Garrafao · Answer 1 · Mon Jan 13 2020 09:05:42 GMT+0800 (China Standard Time)

Hi Tanglinus, thanks for the note! The problem may be related to the default value for the 'comments' option in numpy.loadtxt. In Python 3 by default '#' is interpreted as a comment, which may cause problems if you have words in your corpus containing this string. I now changed it everywhere to 'comments=None'. I hope this solves the problem.

Neo Chow · Answer 2 · Sun Mar 08 2020 02:59:58 GMT+0800 (China Standard Time)

Hello @Tanglinus @Garrafao

I ecountered a similar error here

Traceback (most recent call last):
File "space_creation/txt2w2v.py", line 54, in
main()
File "space_creation/txt2w2v.py", line 41, in main
space_array = np.loadtxt(spacePrefix + '.txt', dtype=object, delimiter=' ', skiprows=0, comments=None, encoding='utf-8')
File "/home/jinan/.conda/envs/TR/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1141, in loadtxt
for x in read_data(_loadtxt_chunksize):
File "/home/jinan/.conda/envs/TR/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1056, in read_data
for i, line in enumerate(line_iter):
File "/home/jinan/.conda/envs/TR/lib/python2.7/codecs.py", line 314, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 2073-2074: invalid continuation byte

Is it the same error you had? Do you mean replacing np.loadtxt with pandas in txt2w2v.py?

Garrafao · Answer 3 · Sun Mar 08 2020 20:59:12 GMT+0800 (China Standard Time)

Hi Jinan, could you please post lines 2073-2074 from your input file here?

Neo Chow · Answer 4 · Mon Mar 09 2020 22:34:58 GMT+0800 (China Standard Time)

Hello @Garrafao

After I conducted deeper tests, I found that it is caused by non-UTF8 characters in my corpus. The problem is sovled after I convert the corpus to UTF-8 encoding. So there is nothing wrong with the code. Thank you!