pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Home Page:https://pytorch.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Assertion `pos >= 0 && pos < buffer.size()` failed

kexul opened this issue · comments

commented

My code runs well for a while , it occurs randomly.....

Traceback (most recent call last): File "train.py", line 236, in <module> train_vgg("/mnt/new_disk/subset/models/accu_distance_metric_243.pth",dset='luna16') File "train.py", line 54, in train_vgg loss_value.backward() File "/home/uih/anaconda2/lib/python2.7/site-packages/torch/autograd/variable.py", line 169, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables) File "/home/uih/anaconda2/lib/python2.7/site-packages/torch/autograd/__init__.py", line 99, in backward variables, grad_variables, retain_graph) RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion "pos >= 0 && pos < buffer.size()" failed.
Any idea what's happend? Thanks in advance~

Yes, it looks like we have a bug somewhere in the code. Can you please try to isolate a small snippet that still triggers the same error so we can look into it? Thank you!

commented

@apaszke It's hard for me to isolate, the error occurs when I do the train and validation in turn, if I just do either the train or val, everthing runs well.

The problem is that we won’t be able to fix it, until we actually see under what conditions does the error happen

@soumith I got the same error in v0.3.0 today. when I trained my model with single GPU, it worked perfectly;with multi-gpu the error occured.
RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion pos >= 0 && pos < buffer.size() failed.

@soumith We've also been struggling with this issues for a long time. Unfortunately, finding which part of our code specifically is triggering the assert (which used to be a segfault before #3466) has proven extremely hard and, as of now, unsuccessful. In fact, we never submitted a bug report as we couldn't create a self-contained piece of code to trigger it.

I'll try to summarize all the information we've gathered:

  • We know for a fact that the issue appeared sometime after this commit: cd9b272. All our code works perfectly fine on cd9b272 or v0.2.
  • The assert is always triggered during the backward pass, but the exact moment when it happens seems to be completely random: when training over Imagenet we've seen this after 50 iterations as well as after 5000. This makes it super-hard to understand whether a particular piece of code is triggering the assert or not, as you can never be sure how many iterations are enough to exclude that the assert will ever trigger!
  • The issue seems to be somehow related to the use of multiple GPUs (via nn.DataParallel): we've never observed it while training with a single GPU.
  • The issue seems also to be related to the use of user-defined autograd.Function classes: we've never seen it when training networks composed of "standard" Pytorch functions / modules only...
  • ... however, simply running forward / backward passes with our custom functions doesn't seem to be enough to trigger the assertion. In our experience, working with "complex" computational graphs increases the likelihood of encountering this issue.

Recently we've been able to re-create the issue using https://github.com/pytorch/examples/tree/master/imagenet and a modified version of https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py, where we replace some of the layers with our custom-made ones. Unfortunately I can't share the code for now (our paper describing the new layers is still not out), but I'll come back to you as soon as I can!

Some additional information: all our experiments are on Ubuntu 16.04 using CUDA 8, CUDNN 7 and TITAN X and Xp GPUs.

@ducksoup almost the same condition as us

Encounter the same condition after serveral imagenet epochs on a NasNet after update to 3.0. Pytorch 2.0 is working well.

I am seeing this too -- I have two runs of the same model running with different hyperparameters, one is running smoothly, this one quit with the exact same error

avg_loss = self.train_epoch(training_data,validation_data,batch_size,save_prefix,epoch)

File "TourQue_v1.py", line 759, in train_epoch
gradients = loss.backward()
File "/u/dcontrac/anaconda2/lib/python2.7/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/u/dcontrac/anaconda2/lib/python2.7/site-packages/torch/autograd/init.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion pos >= 0 && pos < buffer.size() failed.

Running PyTorch v0.3 and training on 4 K-80 GPUs. My other run with different parameters is also running on the exact same setup but its still running. Both runs differ only in shuffle of data and a different batch size (the one that failed had a slightly larger batch size)

@ducksoup : I'm not using any user define autograd.Function but I see the error too, so maybe thats not the primary place with a bug. Using nn.DataParallel though.

@danishcontractor I'll try again to train a network using standard layers only, maybe I'll also encounter the error in the no-custom-functions case if I let it run a little bit longer.
If this is the case, then I guess we have some strong indication that nn.DataParallel (or multi-GPU computation in general) could indeed be the source of the error!

@ducksoup : I dont think the duration has got anything to do with it --- One model run has been running without a problem for the last 18 hours, the other one had quit last night pretty early on (Crashed after the backward pass of the 8th batch). I have just restarted it and watching it.

@danishcontractor I agree that the bug is probably not related to training length. My guess is that each backward pass has a certain probability of randomly failing with this assert, so running for longer just increases the overall chances of encountering the bug.

Thanks for the information. We'd really like to fix this bug, and so there are a few things that would help us out a lot (in order of difficulty):

  1. Can you run your training process with gdb attached with a catch throw breakpoint, so that we can get a backtrace when the assertion fails? Also, if you can go one frame up and print the name of the function (run p fn.name() in gdb) that would also be helpful. We basically want to know if it's either a C++ function or a Python function.

  2. As soon as you can publish your latest layers, uploading a script that reproduces (even if it takes a while to reproduce); we would be happy to run the script to try to reproduce, which would be very helpful for us.

We'd love a bisection, but given that it can sometimes take 18hrs+ to repro, it doesn't sound like that would be easy to do.

@ezyang This is the backtrace:

#0  0x00007fffb4eac8bd in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007fffd2af56ab in torch::barf (fmt=fmt@entry=0x7fffd3a77bc0 "%s:%u: %s: Assertion `%s` failed.")
    at torch/csrc/assertions.cpp:18
#2  0x00007fffd2bcddad in torch::autograd::InputBuffer::add (this=this@entry=0x7fff4affcb00, pos=pos@entry=0, var=...)
    at torch/csrc/autograd/input_buffer.cpp:14
#3  0x00007fffd2bbf5ea in torch::autograd::Engine::evaluate_function (this=this@entry=0x7fffd3fc9ec0 <engine>, task=...)
    at torch/csrc/autograd/engine.cpp:268
#4  0x00007fffd2bc0b18 in torch::autograd::Engine::thread_main (this=0x7fffd3fc9ec0 <engine>, graph_task=0x0)
    at torch/csrc/autograd/engine.cpp:144
#5  0x00007fffd2bbd6a2 in torch::autograd::Engine::thread_init (this=this@entry=0x7fffd3fc9ec0 <engine>, device=device@entry=3)
    at torch/csrc/autograd/engine.cpp:121
#6  0x00007fffd2be18ba in torch::autograd::python::PythonEngine::thread_init (this=0x7fffd3fc9ec0 <engine>, device=3)
    at torch/csrc/autograd/python_engine.cpp:28
#7  0x00007fffb4ed7c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ffff7bc16ba in start_thread (arg=0x7fff4affd700) at pthread_create.c:333
#9  0x00007ffff78f73dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

For the function name I'll probably have to recompile pytorch in debug mode, since p fn.name() in gdb returns value has been optimized out.

Edit - I managed to get the name of the function by catching the exception raised by the assert: N5torch8autograd12ConvBackwardE.

Edit 2 - I did a second run and the name of the function is still the same.

Actually I just realized that what we really need is next_fn.name() 😕 Can you please try printing that (might require a debug build as well)?

@apaszke Printing next_fn.name() instead of fn.name() I get InPlaceABNBackward, which is the name of our custom function's backward pass.

More importantly, our code and paper are now public so I can share this with you:
https://github.com/mapillary/pytorch_bugreport

As I mentioned in a previous comment, this code is just the Pytorch ImageNet + ResNet examples, except that standard BN is replaced with our custom layer.
When running this on our machines with 4 GPUs we always encounter the assert within the first epoch.
Please feel free to ask if you need help running the code.

I also encountered this assert.
I have two machines, one with 2, and other one with 4 GPUs.
Both machines are Ubuntu 16.04, python 2.7, cuda 9.0, cudnn 7, and pytorch v0.3.0.
I never encountered this assert on my 2 GPUs machine.
This assert always appeared on my 4 GPUs machine.

same as here. I have python 2.7, cuda 8.0, cudnn 7 and pytorch v0.3.0 installed from binaries. The assertion was triggered when I was training networks on 4 GPUs. It happened at a random time during training.

Same here, with multiple pixel-wise prediction models with just CNNs. Seems to occur only with multiple GPUs, has not happened so far with single GPU.

python 2.7, CUDA 8.0, CUDNN 7, PyTorch v0.3.0.

@zou3519 can you take a look at this asap

edit: resolved, I ended up adding a compilation flag for compute capacity 3.5 to pytorch_bugreport's building scripts (to work with my gpus)

@ducksoup I'm getting the following after trying to run pytorch_bugreport. Have you seen this before?
image

@zou3519 you have to change https://github.com/mapillary/pytorch_bugreport/blob/master/modules/build.sh#L5-L7 to cover your GPU, which is compute_35,code=sm_35 i think (if it is K40)

@ducksoup I ran your code for six epochs on a machine with 4 gpus on pytorch 0.3.0 but haven't been able to trigger the assert. I'll try letting it run for longer and see if anything pops up

@zou3519 For us the assert almost always triggers during the first epoch. There might be some important difference in our setups. This is the exact sequence of steps we have used to reproduce the error across several machines:

  1. Start from a clean Ubuntu 16.04 installation with CUDA 8.0, CUDNN 7.0, pip, virtualenv and the compilers.
  2. Create and activate a new python2 virtualenv.
  3. Install necessary python packages pip install pyyaml numpy cffi torchvision.
  4. Clone the pytorch repository and checkout v0.3.0
  5. Compile and install with python setup.py install
  6. Clone our repository and compile our native module with cd modules && sh build.sh && python build.py
  7. Run the training as described before.

Our machines all have either Titan X or Titan Xp GPUs, so a different architecture from the one you are using, I don't know if this could have an influence on the error.

commented

My machine has 4 GTX 1080 Ti which triggers the error.

we (finally) found a different repro that triggers this issue internally on one of our machines, a fix will be issued soon.

Quick update: we're still looking into this, but so far, I've been able to get this bug to trigger with two machines using python 2.7 but never with python 3. If you're in need of a quick fix using python 3 might help.

I also encountered this propblem, with pytorch 0.3.0, python2.7, CUDA8.0, Titan X GPUs.
My session always crashed around 8 epochs.
And I degrade pytorch to 0.2.0. now the training seems work well. Someone in emergency can try this.

Thank you @zou3519 , I'll try to switch to python 3 for now!

@ducksoup let me know if the bug comes back in python 3. As far as I can tell the python version shouldn't matter in the triggering of this bug... but it's never happened to me with python 3 yet.

Crashed in Python2.7
RuntimeError: torch/csrc/autograd/input_buffer.cpp:14: add: Assertion pos >= 0 && pos < buffer.size() failed

python 2.7
pytorch: 0.3.post4

Absolutely random crash using multi gpu.

@ShomyLiu This has been fixed in HEAD but there has not been a release of 0.3 with the fix yet.

@ezyang Oh, Thanks for your reply. I will try the HEAD version. Thank you .

i have encountered this error , but change the pytorch to the latest version(0.4) will solve it.

Encountered same error when I add dropout layer to a Resnet which I modified. Trained on multi-GPU. Does latest conda release fix the problem?

@kkjh0723 it is not yet on latest conda release. to get the fix at the moment you will have to install pytorch from source: https://github.com/pytorch/pytorch#from-source

The same bug !
default

@PkuRainBow did you install pytorch from source?

Is there a simple way to block this error without changingthe version of python or pyotrch?

You could disable the python reference cycle garbage collector. This would probably lead to memory leaks though.

It should be safe if you do this once all the graphs are deleted, and there will be no leaks unless you create cycles in your data structures

After 2 days of struggle, I finally changed back to 0.2.0 and expects the release of binary 0.4.0. Anyway, thank you for your advices. :)

I encountered the same error with version 0.3.0.post4. What was the proper fix here?

@FangMath upgrade to version 0.3.1 or newer.