not enough memory: you tried to allocate 0GB.

Question

not enough memory: you tried to allocate 0GB.

michael20at opened this issue 6 years ago · comments

I tried to python main.py --mode prepro --data_file hotpot_train_v1.json --para_limit 2250 --data_split train which worked fine, till i got

RuntimeError: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at ..\aten\src\TH\THGeneral.cpp:204.

I have 32GB of RAM and a 1070 GPU (8GB), is is not enough? And why is it saying 0GB?

PS: Using Pytorch 0.4, might that be an issue? Don't see how it connects to RAM?

Zhilin Yang · Answer 1 · Tue Nov 06 2018 09:55:17 GMT+0800 (China Standard Time)

In my experiments, the current implementation uses 60+ GB CPU RAM for preprocessing, which exceeds the limit of your hardware. You might need to find another machine or modify the preprocessing code (e.g., slicing the input into small chunks).

Vimos Tan · Answer 2 · Tue Jan 08 2019 16:46:43 GMT+0800 (China Standard Time)

I managed to preprocess and training on a 32GB machine, but the preprocessing is slower than the original code. Give it a try, if you don't have access to large machines.

And using hdf5 for data loader may also save memory.

def prepro_train(config):
    random.seed(13)

    record_file = config.train_record_file
    eval_file = config.train_eval_file

    example_jsonl = 'examples.jsonl'
    eval_example_jsonl = 'eval_examples.jsonl'

    word_counter, char_counter = Counter(), Counter()
    data = json.load(open(config.data_file, 'r'))
    with open(example_jsonl, "a") as fr:
        with open(eval_example_jsonl, "a") as fe:
            for article in tqdm(data, total=len(data)):
                example, eval_example = _process_article(article, config)
                for token in example['ques_tokens'] + example['context_tokens']:
                    word_counter[token] += 1
                    for char in token:
                        char_counter[char] += 1
                json.dump(example, fr)
                fr.write('\n')

                json.dump(eval_example, fe)
                fe.write('\n')

    word_emb_mat, word2idx_dict, idx2word_dict = get_embedding(word_counter, "word",
                                                               emb_file=config.glove_word_file,
                                                               size=config.glove_word_size,
                                                               vec_size=config.glove_dim,
                                                               token2idx_dict=None)

    char_emb_mat, char2idx_dict, idx2char_dict = get_embedding(
        char_counter, "char", emb_file=None, size=None, vec_size=config.char_dim, token2idx_dict=None)

    if not os.path.isfile(config.word2idx_file):
        save(config.word_emb_file, word_emb_mat, message="word embedding")
        save(config.char_emb_file, char_emb_mat, message="char embedding")
        save(config.word2idx_file, word2idx_dict, message="word2idx")
        save(config.char2idx_file, char2idx_dict, message="char2idx")
        save(config.idx2word_file, idx2word_dict, message='idx2word')
        save(config.idx2char_file, idx2char_dict, message='idx2char')

    # with open(config.word2idx_file, "r") as fh:
    #     word2idx_dict = json.load(fh)
    #
    # with open(config.char2idx_file, "r") as fh:
    #     char2idx_dict = json.load(fh)

    para_limit = config.para_limit
    ques_limit = config.ques_limit
    char_limit = config.char_limit

    def filter_func(exm):
        return len(exm["context_tokens"]) > para_limit or len(exm["ques_tokens"]) > ques_limit

    # build_features(config, examples, config.data_split, record_file, word2idx_dict, char2idx_dict)
    # save(eval_file, eval_examples, message='{} eval'.format(config.data_split))
    data_points = []
    with tqdm(total=os.path.getsize(example_jsonl)) as pbar:
        with open(example_jsonl, "r") as fr:
            for l in fr:
                pbar.update(len(l))
                example = json.loads(l)
                if filter_func(example):
                    continue

                context_idxs = torch.LongTensor(para_limit).zero_()
                context_char_idxs = torch.LongTensor(para_limit, char_limit).zero_()
                ques_idxs = torch.LongTensor(ques_limit).zero_()
                ques_char_idxs = torch.LongTensor(ques_limit, char_limit).zero_()

                def _get_word(word):
                    for each in (word, word.lower(), word.capitalize(), word.upper()):
                        if each in word2idx_dict:
                            return word2idx_dict[each]
                    return 1

                def _get_char(char):
                    if char in char2idx_dict:
                        return char2idx_dict[char]
                    return 1

                for i, token in enumerate(example["context_tokens"]):
                    context_idxs[i] = _get_word(token)

                for i, token in enumerate(example["ques_tokens"]):
                    ques_idxs[i] = _get_word(token)

                for i, token in enumerate(example["context_chars"]):
                    for j, char in enumerate(token):
                        if j == char_limit:
                            break
                        context_char_idxs[i, j] = _get_char(char)

                for i, token in enumerate(example["ques_chars"]):
                    for j, char in enumerate(token):
                        if j == char_limit:
                            break
                        ques_char_idxs[i, j] = _get_char(char)

                start, end = example["y1s"][-1], example["y2s"][-1]
                y1, y2 = start, end

                data_points.append({'context_idxs': context_idxs,
                                    'context_char_idxs': context_char_idxs,
                                    'ques_idxs': ques_idxs,
                                    'ques_char_idxs': ques_char_idxs,
                                    'y1': y1,
                                    'y2': y2,
                                    'id': example['id'],
                                    'start_end_facts': example['start_end_facts']})
    torch.save(data_points, record_file)

    del data_points

    eval_examples = {}
    with tqdm(total=os.path.getsize(eval_example_jsonl)) as pbar:
        with open(eval_example_jsonl, "r") as fe:
            for l in fe:
                pbar.update(len(l))
                e = json.loads(l)
                eval_examples[e['id']] = e
    save(eval_file, eval_examples, message='{} eval'.format(config.data_split))


def prepro(config):
    if config.data_split == 'train':
        prepro_train(config)
    else:
        prepro_dev(config)

michael20at · Answer 3 · Fri Jan 11 2019 18:11:54 GMT+0800 (China Standard Time)

Thank you, but there is no prepro_dev (last line) in your script, or am I missing someting?

Vimos Tan · Answer 4 · Fri Jan 11 2019 18:19:06 GMT+0800 (China Standard Time)

Rename original prepro to prepro_dev does the trick.

michael20at · Answer 5 · Sat Jan 12 2019 04:56:43 GMT+0800 (China Standard Time)

That worked, thank you for your help, but sadly I still get a memory error on the first part:

6962 tokens have corresponding embedding vector
 66%|█████████████████████████████████████▋                   | 7523156815/11395583004 [2:50:05<1:25:39, 753511.45it/s]
Traceback (most recent call last):
  File "main.py", line 86, in <module>
    prepro(config)
  File "F:\Python\WPy-3670\examples\LSTM\hotpot\prepro.py", line 491, in prepro
    prepro_train(config)
  File "F:\Python\WPy-3670\examples\LSTM\hotpot\prepro.py", line 384, in prepro_train
    context_char_idxs = torch.LongTensor(para_limit, char_limit).zero_()
RuntimeError: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at ..\aten\src\TH\THGeneral.cpp:201

Vimos Tan · Answer 6 · Sat Jan 12 2019 11:30:43 GMT+0800 (China Standard Time)

Then I suggest using multiple hdf5 files to save all the data_points.
When doing training, create an hdf5 dataloader.
A good tutorial is here https://github.com/fab-jul/hdf5_dataloader

michael20at · Answer 7 · Sun Jan 13 2019 08:20:53 GMT+0800 (China Standard Time)

Hm, thank you for your help, but I could preprocess (after I shut down anything taking Ram, after looking at everythin in prepro.py), I could save all files necessary. The problem now appear when I try to train:

After python main.py --mode train --para_limit 2250 --batch_size 24 --init_lr 0.1 --keep_prob 1.0 --sp_lambda 1.0 I get:

invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

Any help would be appreciated, thanks! 👍

Vimos Tan · Answer 8 · Sun Jan 13 2019 15:03:59 GMT+0800 (China Standard Time)

This is due to pytorch version.

-           total_loss += loss.data[0]
+           total_loss += loss.item()

michael20at · Answer 9 · Sun Jan 13 2019 17:04:52 GMT+0800 (China Standard Time)

Thank you again, it's training now, will take a few hours / days, can't wait to see if it works! Hope there are no more errors, you've been most helpful! 😊 👍

michael20at · Answer 10 · Tue Jan 15 2019 06:15:32 GMT+0800 (China Standard Time)

Oh no! It trained fine, but stopped after episode 0 with an F1 score of 46. When I try to predict it gives a shape error!

With python main.py --mode test --data_split dev --para_limit 2250 --batch_size 24 --init_lr 0.1 --keep_prob 1.0 --sp_lambda 1.0 --save HOTPOT-20190113-103231 --prediction_file dev_distractor_pred.json:

RuntimeError: Error(s) in loading state_dict for SPModel:
        size mismatch for rnn_start.rnns.0.weight_ih_l0: copying a param with shape torch.Size([240, 81]) from checkpoint, the shape in current model is torch.Size([240, 240]).
        size mismatch for rnn_start.rnns.0.weight_ih_l0_reverse: copying a param with shape torch.Size([240, 81]) from checkpoint, the shape in current model is torch.Size([240, 240]).
        size mismatch for rnn_end.rnns.0.weight_ih_l0: copying a param with shape torch.Size([240, 241]) from checkpoint, the shape in current model is torch.Size([240, 240]).
        size mismatch for rnn_end.rnns.0.weight_ih_l0_reverse: copying a param with shape torch.Size([240, 241]) from checkpoint, the shape in current model is torch.Size([240, 240]).
        size mismatch for rnn_type.rnns.0.weight_ih_l0: copying a param with shape torch.Size([240, 241]) from checkpoint, the shape in current model is torch.Size([240, 240]).
        size mismatch for rnn_type.rnns.0.weight_ih_l0_reverse: copying a param with shape torch.Size

So close but yet so far! 🤔 Could this be because of the changed preprocessing? Why could this be? Any help appreciated, thank you!

Peng Qi · Answer 11 · Tue Feb 12 2019 02:42:19 GMT+0800 (China Standard Time)

Closing as the last bit of the discussion is opened as a new issue #11