Grad strides do not match bucket view strides.

Question

Grad strides do not match bucket view strides.

xingxinggui opened this issue 4 years ago · comments

[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1024, 1024]
bucket_view.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1, 1] (function operator())

this problem impair performance .
what can i do?

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @VitalyFedyunin @jamesr66a @ppwwyyxx

xingxinggui · Answer 1 · Sun Nov 01 2020 18:15:03 GMT+0800 (China Standard Time)

[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1024, 1024]
bucket_view.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1, 1] (function operator())

this problem impair performance .
what can i do?

my code is cnn
nn.Conv2d(1024, 1024, 1)

xingxinggui · Answer 2 · Sun Nov 01 2020 18:28:29 GMT+0800 (China Standard Time)

[W reducer.cpp:313] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1024, 1024]
bucket_view.sizes() = [1024, 1024, 1, 1], strides() = [1024, 1, 1, 1] (function operator())
this problem impair performance .
what can i do?

my code is cnn
nn.Conv2d(1024, 1024, 1)

I use DDP

Natalia Gimelshein · Answer 3 · Mon Nov 02 2020 05:14:31 GMT+0800 (China Standard Time)

cc @mcarilli. In this case it seems the warning is spurious, strides are nominally different but physical layout is the same.

star · Answer 4 · Sat Jan 09 2021 17:24:59 GMT+0800 (China Standard Time)

I met the problem in distributed training with bacthsize > 1.
In batchsize = 1 or single gpu, it won't occur.
I thought the problem caused by transpose or permute.
When I delete the transpose or permute, or add .contiguous() after above function, it goes work.
Thus I suspect transpose caused the gradients stride wrong. Need to be contigous.

Before:

 vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).view([x.shape[0],-1,*x.shape[2:]])

after: (fixed)

vit_ll = eval('self.vit'+str(i)).forward_wo_cls(x).transpose(1,2).contiguous().view([x.shape[0],-1,*x.shape[2:]])

Sherzodbek Tojimahammatov · Answer 5 · Wed Jan 13 2021 11:31:22 GMT+0800 (China Standard Time)

Agree with @starhiking , tensors should be contiguous once their views have been changed. I also solved my problem in similar way. Looking through this https://pytorch.org/docs/stable/tensor_view.html doc might be very helpful.

MRI000000 · Answer 6 · Sat Sep 04 2021 07:12:28 GMT+0800 (China Standard Time)

@starhiking, When I use 1*1 convolution kernel, it also happened. But why?

lhyciomp · Answer 7 · Tue Sep 14 2021 11:13:05 GMT+0800 (China Standard Time)

@MRI000000 , I also meet the issue, do you have resolved it?

kangmingyu · Answer 8 · Wed Jan 05 2022 20:58:56 GMT+0800 (China Standard Time)

It may made by distributed training.

jxtps · Answer 9 · Sat Feb 11 2023 03:00:14 GMT+0800 (China Standard Time)

I get this issue when using channels_last training, and the optimizer was defined before I switched the model over to channels_last.

Hang Cheng · Answer 10 · Wed Mar 15 2023 00:11:58 GMT+0800 (China Standard Time)

I get this issue when using U-Net and trying to set BatchNorm2d in TransposeConv

nicoloesch · Answer 11 · Mon Jun 05 2023 13:11:44 GMT+0800 (China Standard Time)

Is there an update on it @rohan-varma (tagged you since you removed the triaged tage)? I am facing the same issue but only with DDP - otherwise the code runs through without any issues.

jbm · Answer 12 · Mon Jun 19 2023 03:32:20 GMT+0800 (China Standard Time)

I'm seeing this warning too, though the model seems to be running/converging okay.