SHI-Labs / Neighborhood-Attention-Transformer

Neighborhood Attention Transformer, arxiv 2022 / CVPR 2023. Dilated Neighborhood Attention Transformer, arxiv 2022

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CUDA out of memory

zhangzheng2242 opened this issue · comments

commented

Your work is very good and we have improved our transformer model based on your ideas, but why CUDA out of memory at the same batch_size?In theory, the computation should be reduced and the batch_size should be able to be set to a larger size.

Hello and thank you for your interest.
Could you provide more details? As in what task it is, what the batch size was, and what changed?
Technically NA's memory usage (with the latest CUDA extension) should be even slightly less than a Window Self-Attention block. But if you replaced some other form of attention with NA, it may be different depending on implementation.

As far as memory goes NA itself is quite memory-efficient. It doesn't compute no intermediary tensors other than the attention weights.

Also, please keep in mind that theoretical computation won't always align with memory usage, there's actually no guarantee there. But also keep in mind that for instance deeper models tend to use up a lot more memory when training (a bunch of reasons, for instance the increased context and gradient accumulation), but they will end up using less memory at inference.

I hope this clarifies things a bit, but if it doesn't, feel free to continue the discussion.

commented

Hello, thank you for your reply.
I want to use your theory to solve a problem of neighborhood attention. For example, suppose we get the original QKV dimension (B=10, head_num=1, token_num =15, token_dim =128). 15 tokens is a one-dimensional relationship. We want to calculate the attention of window_size=5.(q[0]*v[0,1,2,3,4]; q[1]*v[0,1,2,3,4]; q[2]*v[0,1,2,3,4]; q[3]*v[1,2,3,4,5], etc.) Generate the corresponding attn_index dimension as (15,5) :
Tensor ([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5]
[2, 3, 4, 5, 6]
[3, 4, 5, 6, 7]
[4, 5, 6, 7, 8]
[5, 6, 7, 8, 9]
[6, 7, 8, 9, 10]
[7, 8, 9, 10, 11]
[8, 9, 10, 11, 12]
[9, 10, 11, 12, 13]
[10, 11, 12, 13, 14]
[10, 11, 12, 13, 14]
[10, 11, 12, 13, 14]
Modified original Transformer
Attn = (q.unsqueeze (3) @ k [:, :, attn_index ]. Transpose (-2,-1)) # A = Q * k ^ T
Attn = attn * self.rescale
Attn = attn. Softmax (dim = -1)
X = (attn @ v[ : , : , attn_index]). Squeeze (3)
I don't know if there is a problem with the code and theory I changed, especially with respect to indexes

Hi,
I'm not sure I understand, but the indices you seem to expect appear to be with respect to a 1D neighborhood and not 2D.
For that, you can use NeighborhoodAttention1d from natten, which we just added.
This class works on 1D data (Batch, Heads, Length, Dim) and would probably be more suitable to your case, if I understand it correctly.

commented

Thank you very much for your help!

commented

gradcheck.py can run successfully
image
But python3 natten/gradcheck1d.py about 1D NA no error and no response.
image

Hello, I have a problem with no response while running 'python3 natten/gradcheck1d.py # 1D NA', no error and no response. But it worked when I ran 'python3 natten/gradcheck.py'.I'm not sure if there's something wrong with the 1dCUDA Extension.

Well you would have to wait for 1D to compile as well, it will start compiling upon calling gradcheck1d.py, if you're using ninja.
How long have you let gradcheck1d.py run? Have you noticed CUDA processes starting?

commented

Thank you, I have found the solution, there is no problem with your code, very good work

commented

Thank you, I have found the solution, there is no problem with your code, very good work