hotpotqa / hotpot

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

not enough memory: you tried to allocate 0GB.

michael20at opened this issue · comments

I tried to python main.py --mode prepro --data_file hotpot_train_v1.json --para_limit 2250 --data_split train which worked fine, till i got

RuntimeError: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at ..\aten\src\TH\THGeneral.cpp:204.

I have 32GB of RAM and a 1070 GPU (8GB), is is not enough? And why is it saying 0GB?

PS: Using Pytorch 0.4, might that be an issue? Don't see how it connects to RAM?

In my experiments, the current implementation uses 60+ GB CPU RAM for preprocessing, which exceeds the limit of your hardware. You might need to find another machine or modify the preprocessing code (e.g., slicing the input into small chunks).

I managed to preprocess and training on a 32GB machine, but the preprocessing is slower than the original code. Give it a try, if you don't have access to large machines.

And using hdf5 for data loader may also save memory.

def prepro_train(config):
    random.seed(13)

    record_file = config.train_record_file
    eval_file = config.train_eval_file

    example_jsonl = 'examples.jsonl'
    eval_example_jsonl = 'eval_examples.jsonl'

    word_counter, char_counter = Counter(), Counter()
    data = json.load(open(config.data_file, 'r'))
    with open(example_jsonl, "a") as fr:
        with open(eval_example_jsonl, "a") as fe:
            for article in tqdm(data, total=len(data)):
                example, eval_example = _process_article(article, config)
                for token in example['ques_tokens'] + example['context_tokens']:
                    word_counter[token] += 1
                    for char in token:
                        char_counter[char] += 1
                json.dump(example, fr)
                fr.write('\n')

                json.dump(eval_example, fe)
                fe.write('\n')

    word_emb_mat, word2idx_dict, idx2word_dict = get_embedding(word_counter, "word",
                                                               emb_file=config.glove_word_file,
                                                               size=config.glove_word_size,
                                                               vec_size=config.glove_dim,
                                                               token2idx_dict=None)

    char_emb_mat, char2idx_dict, idx2char_dict = get_embedding(
        char_counter, "char", emb_file=None, size=None, vec_size=config.char_dim, token2idx_dict=None)

    if not os.path.isfile(config.word2idx_file):
        save(config.word_emb_file, word_emb_mat, message="word embedding")
        save(config.char_emb_file, char_emb_mat, message="char embedding")
        save(config.word2idx_file, word2idx_dict, message="word2idx")
        save(config.char2idx_file, char2idx_dict, message="char2idx")
        save(config.idx2word_file, idx2word_dict, message='idx2word')
        save(config.idx2char_file, idx2char_dict, message='idx2char')

    # with open(config.word2idx_file, "r") as fh:
    #     word2idx_dict = json.load(fh)
    #
    # with open(config.char2idx_file, "r") as fh:
    #     char2idx_dict = json.load(fh)

    para_limit = config.para_limit
    ques_limit = config.ques_limit
    char_limit = config.char_limit

    def filter_func(exm):
        return len(exm["context_tokens"]) > para_limit or len(exm["ques_tokens"]) > ques_limit

    # build_features(config, examples, config.data_split, record_file, word2idx_dict, char2idx_dict)
    # save(eval_file, eval_examples, message='{} eval'.format(config.data_split))
    data_points = []
    with tqdm(total=os.path.getsize(example_jsonl)) as pbar:
        with open(example_jsonl, "r") as fr:
            for l in fr:
                pbar.update(len(l))
                example = json.loads(l)
                if filter_func(example):
                    continue

                context_idxs = torch.LongTensor(para_limit).zero_()
                context_char_idxs = torch.LongTensor(para_limit, char_limit).zero_()
                ques_idxs = torch.LongTensor(ques_limit).zero_()
                ques_char_idxs = torch.LongTensor(ques_limit, char_limit).zero_()

                def _get_word(word):
                    for each in (word, word.lower(), word.capitalize(), word.upper()):
                        if each in word2idx_dict:
                            return word2idx_dict[each]
                    return 1

                def _get_char(char):
                    if char in char2idx_dict:
                        return char2idx_dict[char]
                    return 1

                for i, token in enumerate(example["context_tokens"]):
                    context_idxs[i] = _get_word(token)

                for i, token in enumerate(example["ques_tokens"]):
                    ques_idxs[i] = _get_word(token)

                for i, token in enumerate(example["context_chars"]):
                    for j, char in enumerate(token):
                        if j == char_limit:
                            break
                        context_char_idxs[i, j] = _get_char(char)

                for i, token in enumerate(example["ques_chars"]):
                    for j, char in enumerate(token):
                        if j == char_limit:
                            break
                        ques_char_idxs[i, j] = _get_char(char)

                start, end = example["y1s"][-1], example["y2s"][-1]
                y1, y2 = start, end

                data_points.append({'context_idxs': context_idxs,
                                    'context_char_idxs': context_char_idxs,
                                    'ques_idxs': ques_idxs,
                                    'ques_char_idxs': ques_char_idxs,
                                    'y1': y1,
                                    'y2': y2,
                                    'id': example['id'],
                                    'start_end_facts': example['start_end_facts']})
    torch.save(data_points, record_file)

    del data_points

    eval_examples = {}
    with tqdm(total=os.path.getsize(eval_example_jsonl)) as pbar:
        with open(eval_example_jsonl, "r") as fe:
            for l in fe:
                pbar.update(len(l))
                e = json.loads(l)
                eval_examples[e['id']] = e
    save(eval_file, eval_examples, message='{} eval'.format(config.data_split))


def prepro(config):
    if config.data_split == 'train':
        prepro_train(config)
    else:
        prepro_dev(config)

Thank you, but there is no prepro_dev (last line) in your script, or am I missing someting?

Rename original prepro to prepro_dev does the trick.

That worked, thank you for your help, but sadly I still get a memory error on the first part:

6962 tokens have corresponding embedding vector
 66%|█████████████████████████████████████▋                   | 7523156815/11395583004 [2:50:05<1:25:39, 753511.45it/s]
Traceback (most recent call last):
  File "main.py", line 86, in <module>
    prepro(config)
  File "F:\Python\WPy-3670\examples\LSTM\hotpot\prepro.py", line 491, in prepro
    prepro_train(config)
  File "F:\Python\WPy-3670\examples\LSTM\hotpot\prepro.py", line 384, in prepro_train
    context_char_idxs = torch.LongTensor(para_limit, char_limit).zero_()
RuntimeError: $ Torch: not enough memory: you tried to allocate 0GB. Buy new RAM! at ..\aten\src\TH\THGeneral.cpp:201

Then I suggest using multiple hdf5 files to save all the data_points.
When doing training, create an hdf5 dataloader.
A good tutorial is here https://github.com/fab-jul/hdf5_dataloader

Hm, thank you for your help, but I could preprocess (after I shut down anything taking Ram, after looking at everythin in prepro.py), I could save all files necessary. The problem now appear when I try to train:

After python main.py --mode train --para_limit 2250 --batch_size 24 --init_lr 0.1 --keep_prob 1.0 --sp_lambda 1.0 I get:

invalid index of a 0-dim tensor. Use tensor.item() to convert a 0-dim tensor to a Python number

Any help would be appreciated, thanks! 👍

This is due to pytorch version.

-           total_loss += loss.data[0]
+           total_loss += loss.item()

Thank you again, it's training now, will take a few hours / days, can't wait to see if it works! Hope there are no more errors, you've been most helpful! 😊 👍

Oh no! It trained fine, but stopped after episode 0 with an F1 score of 46. When I try to predict it gives a shape error!

With python main.py --mode test --data_split dev --para_limit 2250 --batch_size 24 --init_lr 0.1 --keep_prob 1.0 --sp_lambda 1.0 --save HOTPOT-20190113-103231 --prediction_file dev_distractor_pred.json:

RuntimeError: Error(s) in loading state_dict for SPModel:
        size mismatch for rnn_start.rnns.0.weight_ih_l0: copying a param with shape torch.Size([240, 81]) from checkpoint, the shape in current model is torch.Size([240, 240]).
        size mismatch for rnn_start.rnns.0.weight_ih_l0_reverse: copying a param with shape torch.Size([240, 81]) from checkpoint, the shape in current model is torch.Size([240, 240]).
        size mismatch for rnn_end.rnns.0.weight_ih_l0: copying a param with shape torch.Size([240, 241]) from checkpoint, the shape in current model is torch.Size([240, 240]).
        size mismatch for rnn_end.rnns.0.weight_ih_l0_reverse: copying a param with shape torch.Size([240, 241]) from checkpoint, the shape in current model is torch.Size([240, 240]).
        size mismatch for rnn_type.rnns.0.weight_ih_l0: copying a param with shape torch.Size([240, 241]) from checkpoint, the shape in current model is torch.Size([240, 240]).
        size mismatch for rnn_type.rnns.0.weight_ih_l0_reverse: copying a param with shape torch.Size

So close but yet so far! 🤔 Could this be because of the changed preprocessing? Why could this be? Any help appreciated, thank you!

Closing as the last bit of the discussion is opened as a new issue #11