not enough memory: you tried to allocate 0GB.
michael20at opened this issue · comments
I tried to python main.py --mode prepro --data_file hotpot_train_v1.json --para_limit 2250 --data_split train which worked fine, till i got
RuntimeError: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at ..\aten\src\TH\THGeneral.cpp:204.
I have 32GB of RAM and a 1070 GPU (8GB), is is not enough? And why is it saying 0GB?
PS: Using Pytorch 0.4, might that be an issue? Don't see how it connects to RAM?
In my experiments, the current implementation uses 60+ GB CPU RAM for preprocessing, which exceeds the limit of your hardware. You might need to find another machine or modify the preprocessing code (e.g., slicing the input into small chunks).
I managed to preprocess and training on a 32GB machine, but the preprocessing is slower than the original code. Give it a try, if you don't have access to large machines.
And using hdf5 for data loader may also save memory.
def prepro_train(config):
random.seed(13)
record_file = config.train_record_file
eval_file = config.train_eval_file
example_jsonl = 'examples.jsonl'
eval_example_jsonl = 'eval_examples.jsonl'
word_counter, char_counter = Counter(), Counter()
data = json.load(open(config.data_file, 'r'))
with open(example_jsonl, "a") as fr:
with open(eval_example_jsonl, "a") as fe:
for article in tqdm(data, total=len(data)):
example, eval_example = _process_article(article, config)
for token in example['ques_tokens'] + example['context_tokens']:
word_counter[token] += 1
for char in token:
char_counter[char] += 1
json.dump(example, fr)
fr.write('\n')
json.dump(eval_example, fe)
fe.write('\n')
word_emb_mat, word2idx_dict, idx2word_dict = get_embedding(word_counter, "word",
emb_file=config.glove_word_file,
size=config.glove_word_size,
vec_size=config.glove_dim,
token2idx_dict=None)
char_emb_mat, char2idx_dict, idx2char_dict = get_embedding(
char_counter, "char", emb_file=None, size=None, vec_size=config.char_dim, token2idx_dict=None)
if not os.path.isfile(config.word2idx_file):
save(config.word_emb_file, word_emb_mat, message="word embedding")
save(config.char_emb_file, char_emb_mat, message="char embedding")
save(config.word2idx_file, word2idx_dict, message="word2idx")
save(config.char2idx_file, char2idx_dict, message="char2idx")
save(config.idx2word_file, idx2word_dict, message='idx2word')
save(config.idx2char_file, idx2char_dict, message='idx2char')
# with open(config.word2idx_file, "r") as fh:
# word2idx_dict = json.load(fh)
#
# with open(config.char2idx_file, "r") as fh:
# char2idx_dict = json.load(fh)
para_limit = config.para_limit
ques_limit = config.ques_limit
char_limit = config.char_limit
def filter_func(exm):
return len(exm["context_tokens"]) > para_limit or len(exm["ques_tokens"]) > ques_limit
# build_features(config, examples, config.data_split, record_file, word2idx_dict, char2idx_dict)
# save(eval_file, eval_examples, message='{} eval'.format(config.data_split))
data_points = []
with tqdm(total=os.path.getsize(example_jsonl)) as pbar:
with open(example_jsonl, "r") as fr:
for l in fr:
pbar.update(len(l))
example = json.loads(l)
if filter_func(example):
continue
context_idxs = torch.LongTensor(para_limit).zero_()
context_char_idxs = torch.LongTensor(para_limit, char_limit).zero_()
ques_idxs = torch.LongTensor(ques_limit).zero_()
ques_char_idxs = torch.LongTensor(ques_limit, char_limit).zero_()
def _get_word(word):
for each in (word, word.lower(), word.capitalize(), word.upper()):
if each in word2idx_dict:
return word2idx_dict[each]
return 1
def _get_char(char):
if char in char2idx_dict:
return char2idx_dict[char]
return 1
for i, token in enumerate(example["context_tokens"]):
context_idxs[i] = _get_word(token)
for i, token in enumerate(example["ques_tokens"]):
ques_idxs[i] = _get_word(token)
for i, token in enumerate(example["context_chars"]):
for j, char in enumerate(token):
if j == char_limit:
break
context_char_idxs[i, j] = _get_char(char)
for i, token in enumerate(example["ques_chars"]):
for j, char in enumerate(token):
if j == char_limit:
break
ques_char_idxs[i, j] = _get_char(char)
start, end = example["y1s"][-1], example["y2s"][-1]
y1, y2 = start, end
data_points.append({'context_idxs': context_idxs,
'context_char_idxs': context_char_idxs,
'ques_idxs': ques_idxs,
'ques_char_idxs': ques_char_idxs,
'y1': y1,
'y2': y2,
'id': example['id'],
'start_end_facts': example['start_end_facts']})
torch.save(data_points, record_file)
del data_points
eval_examples = {}
with tqdm(total=os.path.getsize(eval_example_jsonl)) as pbar:
with open(eval_example_jsonl, "r") as fe:
for l in fe:
pbar.update(len(l))
e = json.loads(l)
eval_examples[e['id']] = e
save(eval_file, eval_examples, message='{} eval'.format(config.data_split))
def prepro(config):
if config.data_split == 'train':
prepro_train(config)
else:
prepro_dev(config)
Thank you, but there is no prepro_dev (last line) in your script, or am I missing someting?
Rename original prepro
to prepro_dev
does the trick.
That worked, thank you for your help, but sadly I still get a memory error on the first part:
6962 tokens have corresponding embedding vector
66%|█████████████████████████████████████▋ | 7523156815/11395583004 [2:50:05<1:25:39, 753511.45it/s]
Traceback (most recent call last):
File "main.py", line 86, in <module>
prepro(config)
File "F:\Python\WPy-3670\examples\LSTM\hotpot\prepro.py", line 491, in prepro
prepro_train(config)
File "F:\Python\WPy-3670\examples\LSTM\hotpot\prepro.py", line 384, in prepro_train
context_char_idxs = torch.LongTensor(para_limit, char_limit).zero_()
RuntimeError: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at ..\aten\src\TH\THGeneral.cpp:201
Then I suggest using multiple hdf5 files to save all the data_points
.
When doing training, create an hdf5 dataloader.
A good tutorial is here https://github.com/fab-jul/hdf5_dataloader
Hm, thank you for your help, but I could preprocess (after I shut down anything taking Ram, after looking at everythin in prepro.py), I could save all files necessary. The problem now appear when I try to train:
After python main.py --mode train --para_limit 2250 --batch_size 24 --init_lr 0.1 --keep_prob 1.0 --sp_lambda 1.0
I get:
invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number
Any help would be appreciated, thanks! 👍
This is due to pytorch version.
- total_loss += loss.data[0]
+ total_loss += loss.item()
Thank you again, it's training now, will take a few hours / days, can't wait to see if it works! Hope there are no more errors, you've been most helpful! 😊 👍
Oh no! It trained fine, but stopped after episode 0 with an F1 score of 46. When I try to predict it gives a shape error!
With python main.py --mode test --data_split dev --para_limit 2250 --batch_size 24 --init_lr 0.1 --keep_prob 1.0 --sp_lambda 1.0 --save HOTPOT-20190113-103231 --prediction_file dev_distractor_pred.json:
RuntimeError: Error(s) in loading state_dict for SPModel:
size mismatch for rnn_start.rnns.0.weight_ih_l0: copying a param with shape torch.Size([240, 81]) from checkpoint, the shape in current model is torch.Size([240, 240]).
size mismatch for rnn_start.rnns.0.weight_ih_l0_reverse: copying a param with shape torch.Size([240, 81]) from checkpoint, the shape in current model is torch.Size([240, 240]).
size mismatch for rnn_end.rnns.0.weight_ih_l0: copying a param with shape torch.Size([240, 241]) from checkpoint, the shape in current model is torch.Size([240, 240]).
size mismatch for rnn_end.rnns.0.weight_ih_l0_reverse: copying a param with shape torch.Size([240, 241]) from checkpoint, the shape in current model is torch.Size([240, 240]).
size mismatch for rnn_type.rnns.0.weight_ih_l0: copying a param with shape torch.Size([240, 241]) from checkpoint, the shape in current model is torch.Size([240, 240]).
size mismatch for rnn_type.rnns.0.weight_ih_l0_reverse: copying a param with shape torch.Size
So close but yet so far! 🤔 Could this be because of the changed preprocessing? Why could this be? Any help appreciated, thank you!