Assertion `pos >= 0 && pos < buffer.size()` failed

Question

Assertion `pos >= 0 && pos < buffer.size()` failed

kexul opened this issue 7 years ago · comments

My code runs well for a while , it occurs randomly.....

Traceback (most recent call last): File "train.py", line 236, in <module> train_vgg("/mnt/new_disk/subset/models/accu_distance_metric_243.pth",dset='luna16') File "train.py", line 54, in train_vgg loss_value.backward() File "/home/uih/anaconda2/lib/python2.7/site-packages/torch/autograd/variable.py", line 169, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "/home/uih/anaconda2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 99, in backward variables, grad_variables, retain_graph) RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion "pos >= 0 && pos < buffer.size()" failed.
Any idea what's happend? Thanks in advance~

Adam Paszke · Answer 1 · Mon Nov 27 2017 00:15:19 GMT+0800 (China Standard Time)

Yes, it looks like we have a bug somewhere in the code. Can you please try to isolate a small snippet that still triggers the same error so we can look into it? Thank you!

Ke · Answer 2 · Tue Nov 28 2017 11:05:36 GMT+0800 (China Standard Time)

@apaszke It's hard for me to isolate, the error occurs when I do the train and validation in turn, if I just do either the train or val, everthing runs well.

Adam Paszke · Answer 3 · Tue Nov 28 2017 16:05:10 GMT+0800 (China Standard Time)

The problem is that we won’t be able to fix it, until we actually see under what conditions does the error happen

Chopinjiang · Answer 4 · Wed Dec 06 2017 17:23:52 GMT+0800 (China Standard Time)

@soumith I got the same error in v0.3.0 today. when I trained my model with single GPU, it worked perfectly;with multi-gpu the error occured.
RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion pos >= 0 && pos < buffer.size() failed.

Lorenzo Porzi · Answer 5 · Wed Dec 06 2017 21:47:59 GMT+0800 (China Standard Time)

@soumith We've also been struggling with this issues for a long time. Unfortunately, finding which part of our code specifically is triggering the assert (which used to be a segfault before #3466) has proven extremely hard and, as of now, unsuccessful. In fact, we never submitted a bug report as we couldn't create a self-contained piece of code to trigger it.

I'll try to summarize all the information we've gathered:

We know for a fact that the issue appeared sometime after this commit: cd9b272. All our code works perfectly fine on cd9b272 or v0.2.
The assert is always triggered during the backward pass, but the exact moment when it happens seems to be completely random: when training over Imagenet we've seen this after 50 iterations as well as after 5000. This makes it super-hard to understand whether a particular piece of code is triggering the assert or not, as you can never be sure how many iterations are enough to exclude that the assert will ever trigger!
The issue seems to be somehow related to the use of multiple GPUs (via nn.DataParallel): we've never observed it while training with a single GPU.
The issue seems also to be related to the use of user-defined autograd.Function classes: we've never seen it when training networks composed of "standard" Pytorch functions / modules only...
... however, simply running forward / backward passes with our custom functions doesn't seem to be enough to trigger the assertion. In our experience, working with "complex" computational graphs increases the likelihood of encountering this issue.

Recently we've been able to re-create the issue using https://github.com/pytorch/examples/tree/master/imagenet and a modified version of https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py, where we replace some of the layers with our custom-made ones. Unfortunately I can't share the code for now (our paper describing the new layers is still not out), but I'll come back to you as soon as I can!

Some additional information: all our experiments are on Ubuntu 16.04 using CUDA 8, CUDNN 7 and TITAN X and Xp GPUs.

Chopinjiang · Answer 6 · Thu Dec 07 2017 09:53:17 GMT+0800 (China Standard Time)

@ducksoup almost the same condition as us

Yuan Gao · Answer 7 · Thu Dec 07 2017 11:25:36 GMT+0800 (China Standard Time)

Encounter the same condition after serveral imagenet epochs on a NasNet after update to 3.0. Pytorch 2.0 is working well.

Danish Contractor · Answer 8 · Thu Dec 07 2017 16:15:19 GMT+0800 (China Standard Time)

I am seeing this too -- I have two runs of the same model running with different hyperparameters, one is running smoothly, this one quit with the exact same error

avg_loss = self.train_epoch(training_data,validation_data,batch_size,save_prefix,epoch)

File "TourQue_v1.py", line 759, in train_epoch
gradients = loss.backward()
File "/u/dcontrac/anaconda2/lib/python2.7/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/u/dcontrac/anaconda2/lib/python2.7/site-packages/torch/autograd/init.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion pos >= 0 && pos < buffer.size() failed.

Running PyTorch v0.3 and training on 4 K-80 GPUs. My other run with different parameters is also running on the exact same setup but its still running. Both runs differ only in shuffle of data and a different batch size (the one that failed had a slightly larger batch size)

Danish Contractor · Answer 9 · Thu Dec 07 2017 16:22:23 GMT+0800 (China Standard Time)

@ducksoup : I'm not using any user define autograd.Function but I see the error too, so maybe thats not the primary place with a bug. Using nn.DataParallel though.

Lorenzo Porzi · Answer 10 · Thu Dec 07 2017 16:35:31 GMT+0800 (China Standard Time)

@danishcontractor I'll try again to train a network using standard layers only, maybe I'll also encounter the error in the no-custom-functions case if I let it run a little bit longer.
If this is the case, then I guess we have some strong indication that nn.DataParallel (or multi-GPU computation in general) could indeed be the source of the error!

Danish Contractor · Answer 11 · Thu Dec 07 2017 16:43:15 GMT+0800 (China Standard Time)

@ducksoup : I dont think the duration has got anything to do with it --- One model run has been running without a problem for the last 18 hours, the other one had quit last night pretty early on (Crashed after the backward pass of the 8th batch). I have just restarted it and watching it.

Lorenzo Porzi · Answer 12 · Thu Dec 07 2017 17:03:39 GMT+0800 (China Standard Time)

@danishcontractor I agree that the bug is probably not related to training length. My guess is that each backward pass has a certain probability of randomly failing with this assert, so running for longer just increases the overall chances of encountering the bug.

Edward Z. Yang · Answer 13 · Thu Dec 07 2017 18:26:24 GMT+0800 (China Standard Time)

Thanks for the information. We'd really like to fix this bug, and so there are a few things that would help us out a lot (in order of difficulty):

Can you run your training process with gdb attached with a catch throw breakpoint, so that we can get a backtrace when the assertion fails? Also, if you can go one frame up and print the name of the function (run p fn.name() in gdb) that would also be helpful. We basically want to know if it's either a C++ function or a Python function.
As soon as you can publish your latest layers, uploading a script that reproduces (even if it takes a while to reproduce); we would be happy to run the script to try to reproduce, which would be very helpful for us.

We'd love a bisection, but given that it can sometimes take 18hrs+ to repro, it doesn't sound like that would be easy to do.

Lorenzo Porzi · Answer 14 · Thu Dec 07 2017 22:00:14 GMT+0800 (China Standard Time)

@ezyang This is the backtrace:

#0  0x00007fffb4eac8bd in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007fffd2af56ab in torch::barf (fmt=fmt@entry=0x7fffd3a77bc0 "%s:%u: %s: Assertion `%s` failed.")
    at torch/csrc/assertions.cpp:18
#2  0x00007fffd2bcddad in torch::autograd::InputBuffer::add (this=this@entry=0x7fff4affcb00, pos=pos@entry=0, var=...)
    at torch/csrc/autograd/input_buffer.cpp:14
#3  0x00007fffd2bbf5ea in torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffd3fc9ec0 <engine>, task=...)
    at torch/csrc/autograd/engine.cpp:268
#4  0x00007fffd2bc0b18 in torch::autograd::Engine::thread_main (this=0x7fffd3fc9ec0 <engine>, graph_task=0x0)
    at torch/csrc/autograd/engine.cpp:144
#5  0x00007fffd2bbd6a2 in torch::autograd::Engine::thread_init (this=this@entry=0x7fffd3fc9ec0 <engine>, device=device@entry=3)
    at torch/csrc/autograd/engine.cpp:121
#6  0x00007fffd2be18ba in torch::autograd::python::PythonEngine::thread_init (this=0x7fffd3fc9ec0 <engine>, device=3)
    at torch/csrc/autograd/python_engine.cpp:28
#7  0x00007fffb4ed7c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff7bc16ba in start_thread (arg=0x7fff4affd700) at pthread_create.c:333
#9  0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

For the function name I'll probably have to recompile pytorch in debug mode, since p fn.name() in gdb returns value has been optimized out.

Edit - I managed to get the name of the function by catching the exception raised by the assert: N5torch8autograd12ConvBackwardE.

Edit 2 - I did a second run and the name of the function is still the same.

Adam Paszke · Answer 15 · Sat Dec 09 2017 03:42:29 GMT+0800 (China Standard Time)

Actually I just realized that what we really need is next_fn.name() 😕 Can you please try printing that (might require a debug build as well)?

Lorenzo Porzi · Answer 16 · Mon Dec 11 2017 22:44:29 GMT+0800 (China Standard Time)

@apaszke Printing next_fn.name() instead of fn.name() I get InPlaceABNBackward, which is the name of our custom function's backward pass.

More importantly, our code and paper are now public so I can share this with you:
https://github.com/mapillary/pytorch_bugreport

As I mentioned in a previous comment, this code is just the Pytorch ImageNet + ResNet examples, except that standard BN is replaced with our custom layer.
When running this on our machines with 4 GPUs we always encounter the assert within the first epoch.
Please feel free to ask if you need help running the code.

Jia-Ren Chang · Answer 17 · Tue Dec 12 2017 02:41:41 GMT+0800 (China Standard Time)

I also encountered this assert.
I have two machines, one with 2, and other one with 4 GPUs.
Both machines are Ubuntu 16.04, python 2.7, cuda 9.0, cudnn 7, and pytorch v0.3.0.
I never encountered this assert on my 2 GPUs machine.
This assert always appeared on my 4 GPUs machine.

santis_yang · Answer 18 · Tue Dec 12 2017 09:16:56 GMT+0800 (China Standard Time)

same as here. I have python 2.7, cuda 8.0, cudnn 7 and pytorch v0.3.0 installed from binaries. The assertion was triggered when I was training networks on 4 GPUs. It happened at a random time during training.

min2209 · Answer 19 · Tue Dec 12 2017 09:29:24 GMT+0800 (China Standard Time)

Same here, with multiple pixel-wise prediction models with just CNNs. Seems to occur only with multiple GPUs, has not happened so far with single GPU.

python 2.7, CUDA 8.0, CUDNN 7, PyTorch v0.3.0.

Soumith Chintala · Answer 20 · Wed Dec 13 2017 07:17:16 GMT+0800 (China Standard Time)

@zou3519 can you take a look at this asap

Richard Zou · Answer 21 · Thu Dec 14 2017 01:18:28 GMT+0800 (China Standard Time)

edit: resolved, I ended up adding a compilation flag for compute capacity 3.5 to pytorch_bugreport's building scripts (to work with my gpus)

@ducksoup I'm getting the following after trying to run pytorch_bugreport. Have you seen this before?

Soumith Chintala · Answer 22 · Thu Dec 14 2017 01:37:08 GMT+0800 (China Standard Time)

@zou3519 you have to change https://github.com/mapillary/pytorch_bugreport/blob/master/modules/build.sh#L5-L7 to cover your GPU, which is compute_35,code=sm_35 i think (if it is K40)

Richard Zou · Answer 23 · Fri Dec 15 2017 00:12:18 GMT+0800 (China Standard Time)

@ducksoup I ran your code for six epochs on a machine with 4 gpus on pytorch 0.3.0 but haven't been able to trigger the assert. I'll try letting it run for longer and see if anything pops up

Lorenzo Porzi · Answer 24 · Fri Dec 15 2017 16:41:28 GMT+0800 (China Standard Time)

@zou3519 For us the assert almost always triggers during the first epoch. There might be some important difference in our setups. This is the exact sequence of steps we have used to reproduce the error across several machines:

Start from a clean Ubuntu 16.04 installation with CUDA 8.0, CUDNN 7.0, pip, virtualenv and the compilers.
Create and activate a new python2 virtualenv.
Install necessary python packages pip install pyyaml numpy cffi torchvision.
Clone the pytorch repository and checkout v0.3.0
Compile and install with python setup.py install
Clone our repository and compile our native module with cd modules && sh build.sh && python build.py
Run the training as described before.

Our machines all have either Titan X or Titan Xp GPUs, so a different architecture from the one you are using, I don't know if this could have an influence on the error.

Ke · Answer 25 · Sat Dec 16 2017 19:42:06 GMT+0800 (China Standard Time)

My machine has 4 GTX 1080 Ti which triggers the error.

Soumith Chintala · Answer 26 · Tue Dec 19 2017 23:09:57 GMT+0800 (China Standard Time)

we (finally) found a different repro that triggers this issue internally on one of our machines, a fix will be issued soon.

Richard Zou · Answer 27 · Thu Dec 21 2017 05:57:52 GMT+0800 (China Standard Time)

Quick update: we're still looking into this, but so far, I've been able to get this bug to trigger with two machines using python 2.7 but never with python 3. If you're in need of a quick fix using python 3 might help.

robindume · Answer 28 · Thu Dec 21 2017 20:00:16 GMT+0800 (China Standard Time)

I also encountered this propblem, with pytorch 0.3.0, python2.7, CUDA8.0, Titan X GPUs.
My session always crashed around 8 epochs.
And I degrade pytorch to 0.2.0. now the training seems work well. Someone in emergency can try this.

Lorenzo Porzi · Answer 29 · Fri Dec 22 2017 18:54:34 GMT+0800 (China Standard Time)

Thank you @zou3519 , I'll try to switch to python 3 for now!

Richard Zou · Answer 30 · Fri Dec 22 2017 23:18:14 GMT+0800 (China Standard Time)

@ducksoup let me know if the bug comes back in python 3. As far as I can tell the python version shouldn't matter in the triggering of this bug... but it's never happened to me with python 3 yet.

HT Liu · Answer 31 · Mon Jan 08 2018 09:58:36 GMT+0800 (China Standard Time)

Crashed in Python2.7
RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion pos >= 0 && pos < buffer.size() failed

python 2.7
pytorch: 0.3.post4

Absolutely random crash using multi gpu.

Edward Z. Yang · Answer 32 · Mon Jan 08 2018 10:03:45 GMT+0800 (China Standard Time)

@ShomyLiu This has been fixed in HEAD but there has not been a release of 0.3 with the fix yet.

HT Liu · Answer 33 · Mon Jan 08 2018 10:06:31 GMT+0800 (China Standard Time)

@ezyang Oh, Thanks for your reply. I will try the HEAD version. Thank you .

visonpon · Answer 34 · Tue Jan 09 2018 20:07:10 GMT+0800 (China Standard Time)

i have encountered this error , but change the pytorch to the latest version(0.4) will solve it.

Jinhyung · Answer 35 · Mon Jan 15 2018 16:48:49 GMT+0800 (China Standard Time)

Encountered same error when I add dropout layer to a Resnet which I modified. Trained on multi-GPU. Does latest conda release fix the problem?

Soumith Chintala · Answer 36 · Tue Jan 16 2018 03:29:56 GMT+0800 (China Standard Time)

@kkjh0723 it is not yet on latest conda release. to get the fix at the moment you will have to install pytorch from source: https://github.com/pytorch/pytorch#from-source

Researcher.YuanYuhui · Answer 37 · Thu Jan 25 2018 23:47:07 GMT+0800 (China Standard Time)

The same bug !

Richard Zou · Answer 38 · Thu Jan 25 2018 23:52:25 GMT+0800 (China Standard Time)

@PkuRainBow did you install pytorch from source?

Cheng Wang · Answer 39 · Fri Jan 26 2018 11:24:43 GMT+0800 (China Standard Time)

Is there a simple way to block this error without changingthe version of python or pyotrch?

Richard Zou · Answer 40 · Sat Jan 27 2018 00:04:54 GMT+0800 (China Standard Time)

You could disable the python reference cycle garbage collector. This would probably lead to memory leaks though.

Adam Paszke · Answer 41 · Sat Jan 27 2018 00:10:19 GMT+0800 (China Standard Time)

It should be safe if you do this once all the graphs are deleted, and there will be no leaks unless you create cycles in your data structures

Cheng Wang · Answer 42 · Mon Jan 29 2018 14:17:58 GMT+0800 (China Standard Time)

After 2 days of struggle, I finally changed back to 0.2.0 and expects the release of binary 0.4.0. Anyway, thank you for your advices. :)

Fang Fang · Answer 43 · Tue Mar 06 2018 00:14:54 GMT+0800 (China Standard Time)

I encountered the same error with version 0.3.0.post4. What was the proper fix here?

Richard Zou · Answer 44 · Tue Mar 06 2018 00:23:29 GMT+0800 (China Standard Time)

@FangMath upgrade to version 0.3.1 or newer.