CUDA out of memory

Question

CUDA out of memory

zhangzheng2242 opened this issue 2 years ago · comments

Your work is very good and we have improved our transformer model based on your ideas, but why CUDA out of memory at the same batch_size?In theory, the computation should be reduced and the batch_size should be able to be set to a larger size.

Ali Hassani · Answer 1 · Fri May 13 2022 01:19:06 GMT+0800 (China Standard Time)

Hello and thank you for your interest.
Could you provide more details? As in what task it is, what the batch size was, and what changed?
Technically NA's memory usage (with the latest CUDA extension) should be even slightly less than a Window Self-Attention block. But if you replaced some other form of attention with NA, it may be different depending on implementation.

As far as memory goes NA itself is quite memory-efficient. It doesn't compute no intermediary tensors other than the attention weights.

Also, please keep in mind that theoretical computation won't always align with memory usage, there's actually no guarantee there. But also keep in mind that for instance deeper models tend to use up a lot more memory when training (a bunch of reasons, for instance the increased context and gradient accumulation), but they will end up using less memory at inference.

I hope this clarifies things a bit, but if it doesn't, feel free to continue the discussion.

Adida · Answer 2 · Sun May 15 2022 21:23:53 GMT+0800 (China Standard Time)

Hello, thank you for your reply.
I want to use your theory to solve a problem of neighborhood attention. For example, suppose we get the original QKV dimension (B=10, head_num=1, token_num =15, token_dim =128). 15 tokens is a one-dimensional relationship. We want to calculate the attention of window_size=5.(q[0]*v[0,1,2,3,4]; q[1]*v[0,1,2,3,4]; q[2]*v[0,1,2,3,4]; q[3]*v[1,2,3,4,5], etc.) Generate the corresponding attn_index dimension as (15,5) :
Tensor ([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5]
[2, 3, 4, 5, 6]
[3, 4, 5, 6, 7]
[4, 5, 6, 7, 8]
[5, 6, 7, 8, 9]
[6, 7, 8, 9, 10]
[7, 8, 9, 10, 11]
[8, 9, 10, 11, 12]
[9, 10, 11, 12, 13]
[10, 11, 12, 13, 14]
[10, 11, 12, 13, 14]
[10, 11, 12, 13, 14]
Modified original Transformer
Attn = (q.unsqueeze (3) @ k [:, :, attn_index ]. Transpose (-2,-1)) # A = Q * k ^ T
Attn = attn * self.rescale
Attn = attn. Softmax (dim = -1)
X = (attn @ v[ : , : , attn_index]). Squeeze (3)
I don't know if there is a problem with the code and theory I changed, especially with respect to indexes

Ali Hassani · Answer 3 · Mon May 16 2022 04:08:09 GMT+0800 (China Standard Time)

Hi,
I'm not sure I understand, but the indices you seem to expect appear to be with respect to a 1D neighborhood and not 2D.
For that, you can use NeighborhoodAttention1d from natten, which we just added.
This class works on 1D data (Batch, Heads, Length, Dim) and would probably be more suitable to your case, if I understand it correctly.

Adida · Answer 4 · Mon May 16 2022 10:01:35 GMT+0800 (China Standard Time)

Thank you very much for your help！

Adida · Answer 5 · Tue May 17 2022 18:02:53 GMT+0800 (China Standard Time)

gradcheck.py can run successfully

But python3 natten/gradcheck1d.py about 1D NA no error and no response.

Hello, I have a problem with no response while running 'python3 natten/gradcheck1d.py # 1D NA', no error and no response. But it worked when I ran 'python3 natten/gradcheck.py'.I'm not sure if there's something wrong with the 1dCUDA Extension.

Ali Hassani · Answer 6 · Wed May 18 2022 01:57:21 GMT+0800 (China Standard Time)

Well you would have to wait for 1D to compile as well, it will start compiling upon calling gradcheck1d.py, if you're using ninja.
How long have you let gradcheck1d.py run? Have you noticed CUDA processes starting?

Adida · Answer 7 · Wed May 18 2022 14:18:40 GMT+0800 (China Standard Time)

Thank you, I have found the solution, there is no problem with your code, very good work

Adida · Answer 8 · Wed May 18 2022 14:18:47 GMT+0800 (China Standard Time)

Thank you, I have found the solution, there is no problem with your code, very good work