yoonkim / CNN_sentence

CNNs for sentence classification

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

a pickle file problem

opened this issue · comments

Hi, @yoonkim
I am a beginner of natural language processing and machine learning. Since 'GoogleNews-vectors-negative300.bin' file size is quite large, all of my attemps for making a pickle file ('mr.p') failed. Could you give me some pieces of advice for making 'mr.p' with 16GB~32GB RAM if you don't mind?

And.. I wonder if 'mr.p' also need a chunk process to solve the memory problem. (I little know about pickle file..)

Thank you

Hi, @soohyunee did you solve this question? I am confusing this problem right now. Do you have any idea?

Hi, @GaoZhongqin
I didn't solve the 'mr.p' related problems, but this Kaggle kernel helped me to make an embedding layer without troubles.

https://www.kaggle.com/ia1na09/cnn-keras-pretrained-word2vec-yoon-kim-model

If you don't need 'mr.p' file, I suggest you the way of this kernel. I hope the kernel helps you as well :)
Thank you

Hello all,

If you are attempting to do this under python 3 and are having memory limitation problems, then your issue likely lies within the string processing. Python 2 and Python 3 process binary files differently where all comparisons of binary strings in Python 3 must be preceded by a lowercase b for it to be successful.
Here is an example:

with open(fname, "rb") as f:
for line in range(foo):
ch = f.read(1)
if ch == b' ':
do something

Notice the space ' ' has a b before it: b' '
Without this b, that comparison will always be false if that character is a space in a binary file. This can lead to a memory leak that can grow to infinite size.

Hope this helps.