Garrafao / TemporalReferencing

An easy and robust model for Lexical Semantic Change Detection

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

np.loadtxt issue in Python3

Tanglinus opened this issue · comments

Hi! Thank you for your code!
I would like to leave a note here for whoever wants to practice this framework in Python3.

The numpy.loadtxt (also numpy.genfromtxt) module doesn't work (inputs the *vectors.txt as empty files and causes an index error) in Python3 may due to the unicode problem.
The simplest way to solve this issue is to replace this IO part with pandas library.
Besides, all the string objects need to be either replacing by str.encoding('utf-8') or just dropping str.uncoding('utf-8').

Everything else works well! Thank you for your great work again!

Hi Tanglinus, thanks for the note! The problem may be related to the default value for the 'comments' option in numpy.loadtxt. In Python 3 by default '#' is interpreted as a comment, which may cause problems if you have words in your corpus containing this string. I now changed it everywhere to 'comments=None'. I hope this solves the problem.

Hello @Tanglinus @Garrafao

I ecountered a similar error here

Traceback (most recent call last):
File "space_creation/txt2w2v.py", line 54, in
main()
File "space_creation/txt2w2v.py", line 41, in main
space_array = np.loadtxt(spacePrefix + '.txt', dtype=object, delimiter=' ', skiprows=0, comments=None, encoding='utf-8')
File "/home/jinan/.conda/envs/TR/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1141, in loadtxt
for x in read_data(_loadtxt_chunksize):
File "/home/jinan/.conda/envs/TR/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1056, in read_data
for i, line in enumerate(line_iter):
File "/home/jinan/.conda/envs/TR/lib/python2.7/codecs.py", line 314, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 2073-2074: invalid continuation byte

Is it the same error you had? Do you mean replacing np.loadtxt with pandas in txt2w2v.py?

Hi Jinan, could you please post lines 2073-2074 from your input file here?

Hello @Garrafao

After I conducted deeper tests, I found that it is caused by non-UTF8 characters in my corpus. The problem is sovled after I convert the corpus to UTF-8 encoding. So there is nothing wrong with the code. Thank you!