a pickle file problem

Question

a pickle file problem

opened this issue 5 years ago · comments

Hi, @yoonkim
I am a beginner of natural language processing and machine learning. Since 'GoogleNews-vectors-negative300.bin' file size is quite large, all of my attemps for making a pickle file ('mr.p') failed. Could you give me some pieces of advice for making 'mr.p' with 16GB~32GB RAM if you don't mind?

And.. I wonder if 'mr.p' also need a chunk process to solve the memory problem. (I little know about pickle file..)

Thank you

GaoZhongqin · Answer 1 · Mon Sep 30 2019 14:36:25 GMT+0800 (China Standard Time)

Hi, @soohyunee did you solve this question? I am confusing this problem right now. Do you have any idea?

Deleted user · Answer 2 · Mon Sep 30 2019 22:08:34 GMT+0800 (China Standard Time)

Hi, @GaoZhongqin
I didn't solve the 'mr.p' related problems, but this Kaggle kernel helped me to make an embedding layer without troubles.

https://www.kaggle.com/ia1na09/cnn-keras-pretrained-word2vec-yoon-kim-model

If you don't need 'mr.p' file, I suggest you the way of this kernel. I hope the kernel helps you as well :)
Thank you

Justin Weigle · Answer 3 · Wed Oct 30 2019 04:37:16 GMT+0800 (China Standard Time)

Hello all,

If you are attempting to do this under python 3 and are having memory limitation problems, then your issue likely lies within the string processing. Python 2 and Python 3 process binary files differently where all comparisons of binary strings in Python 3 must be preceded by a lowercase b for it to be successful.
Here is an example:

with open(fname, "rb") as f:
for line in range(foo):
ch = f.read(1)
if ch == b' ':
do something

Notice the space ' ' has a b before it: b' '
Without this b, that comparison will always be false if that character is a space in a binary file. This can lead to a memory leak that can grow to infinite size.

Hope this helps.