galsang / BiDAF-pytorch

Re-implementation of BiDAF(Bidirectional Attention Flow for Machine Comprehension, Minjoon Seo et al., ICLR 2017) on PyTorch.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU memory issues

FelixAbrahamsson opened this issue · comments

First of all, fantastic work on this implementation of BiDAF, very compact and readable!

I am however having strange trouble with GPU memory consumption. With train batch size 10, dev batch size 50 and context threshold 400 it uses up to 10 GB memory during training. Full disclosure, I'm using a google translated version of SQuAD into a different language, but with the context threshold set to 400 I don't expect this to make a significant difference. There is also not at all a linear relationship between batch size and memory consumption, for example I can almost keep the train batch size at 20 without running out of memory (my card has 12 GB memory). Any idea what might cause this behaviour? Were you able to train the network at batch sizes 60/100 with 12 GB GPU ram?
EDIT: For reference, I have been able to train BiDAF with the same dataset and the same hardware using the authors TF implementation at batch size 50.

I also noticed a couple of minor issues:

  1. not specifying a dimension for torch.squeeze() in the forward function of the model will cause it to crash with batch size 1. I am far from a pytorch expert so I can't say what best practice is, but to me it seems good to always specify the dimension argument to avoid these types of issues :)
  2. if the maximum word size of all questions in a batch is less than the char channel width the forward function crashes. I solved this by adding the following function as a postprocessing function to the CHAR_NESTING field in the SQuAD data module, which just adds padding to the chars:
def char_postprocessing(batch, vocab):

    padded_batch = []
    pad_length = 2

    for chars in batch:
        pad_behind = [vocab.stoi['<pad>'] for _ in range(pad_length)]
        pad_front = [vocab.stoi['<pad>'] for _ in range(pad_length)]
        chars = pad_behind + chars + pad_front
        padded_batch.append(chars)

    return padded_batch

Thanks for your helpful feedback.
Minor issues you mentioned will be fixed in a few (I'm not sure) days.
The memory issue is actually somewhat complex,
maybe it's due to both the

  1. different behavior between PyTorch and Tensorflow and
  2. incomplete and inefficient implementation.

One problem I've already noticed is that we need to utilize with torch.set_grad_enabled(False): in test time not to save a gradient history.

Ah yes, that's a good point about torch.set_grad_enabled(False), that should reduce memory consumption at test time by a large amount.

Actually I wasn't aware of how memory allocation works in Pytorch, it apparently uses some kind of caching memory allocator which is why the nvidia-smi command might sometimes show the gpu using all of its memory at batch size 10.

So it seems that the hard limit for me is around batch size 20 now, with test operations wrapped under torch.no_grad(). This is still a fair bit below what I had with TF, though I don't remember what value was used to limit too long contexts.

Can confirm that updating to torch 1.0 solved the memory issue! I can now train the network with batch size 60.