Race condition on GPU device

Question

Race condition on GPU device

ymwangg opened this issue a year ago · comments

🐛 Bug

We found there's likely a race condition on XLA:GPU device while debugging ViT model. I suspect it's between GPU main stream execution and BFC allocator asynchronous deallocation.

Previously we reported a race condition between nccl async communication stream and BFC allocator asynchronous deallocation tensorflow/tensorflow#58022, but now it seems the main stream also has the same issue.

To Reproduce

Patch the test/test_train_mp_imagenet.py file to support ViT model.

diff --git a/test/test_train_mp_imagenet.py b/test/test_train_mp_imagenet.py
index 7a7a1300..9cb75252 100644
--- a/test/test_train_mp_imagenet.py
+++ b/test/test_train_mp_imagenet.py
@@ -4,7 +4,7 @@ SUPPORTED_MODELS = [
     'alexnet', 'densenet121', 'densenet161', 'densenet169', 'densenet201',
     'inception_v3', 'resnet101', 'resnet152', 'resnet18', 'resnet34',
     'resnet50', 'squeezenet1_0', 'squeezenet1_1', 'vgg11', 'vgg11_bn', 'vgg13',
-    'vgg13_bn', 'vgg16', 'vgg16_bn', 'vgg19', 'vgg19_bn'
+    'vgg13_bn', 'vgg16', 'vgg16_bn', 'vgg19', 'vgg19_bn', 'vit_b_16',
 ]
 
 MODEL_OPTS = {

Run training with imagenet dataset on a single gpu:

 GPU_NUM_DEVICES=1 python test_train_mp_imagenet.py --model vit_b_16 --datadir /dataset/ --batch_size 128

Typically there will be NaN loss shown within the first 100 step.

The following methods can fix the NaN loss issue:

Set CUDA_LAUNCH_BLOCKING=1 (a good indication that there's race condition).
Set AllowsAsynchronousDeallocation()=false for BFC allocator.

Environment

Reproducible on XLA backend [GPU]: tested on Nvidia-V100 and T4.
torch_xla version: docker image gcr.io/tpu-pytorch/xla:nightly_3.7_cuda_11.2, image_id = e1d95d077920.

Additional context

It looks BFC allocator by design should support asynchronous deallocation based on the comment in tensorflow:

  // The Tensorflow BFC allocator used on GPU allows host-side deallocation
  // before GPU execution takes place. Tensorflow uses the ordering of the main
  // compute stream to enforce a happens-before relationship between a memory
  // allocation and code that reuses the same memory. If Tensorflow adds
  // support for multiple GPU streams or allocators with different ordering
  // requirements, this code may need to change.
  // (This attribute has no effect on CPU.)
  bool AllowsAsynchronousDeallocation() const override { return true; }