Race condition on GPU device
ymwangg opened this issue Β· comments
π Bug
We found there's likely a race condition on XLA:GPU device while debugging ViT model. I suspect it's between GPU main stream execution and BFC allocator asynchronous deallocation.
Previously we reported a race condition between nccl async communication stream and BFC allocator asynchronous deallocation tensorflow/tensorflow#58022, but now it seems the main stream also has the same issue.
To Reproduce
Patch the test/test_train_mp_imagenet.py
file to support ViT model.
diff --git a/test/test_train_mp_imagenet.py b/test/test_train_mp_imagenet.py
index 7a7a1300..9cb75252 100644
--- a/test/test_train_mp_imagenet.py
+++ b/test/test_train_mp_imagenet.py
@@ -4,7 +4,7 @@ SUPPORTED_MODELS = [
'alexnet', 'densenet121', 'densenet161', 'densenet169', 'densenet201',
'inception_v3', 'resnet101', 'resnet152', 'resnet18', 'resnet34',
'resnet50', 'squeezenet1_0', 'squeezenet1_1', 'vgg11', 'vgg11_bn', 'vgg13',
- 'vgg13_bn', 'vgg16', 'vgg16_bn', 'vgg19', 'vgg19_bn'
+ 'vgg13_bn', 'vgg16', 'vgg16_bn', 'vgg19', 'vgg19_bn', 'vit_b_16',
]
MODEL_OPTS = {
Run training with imagenet dataset on a single gpu:
GPU_NUM_DEVICES=1 python test_train_mp_imagenet.py --model vit_b_16 --datadir /dataset/ --batch_size 128
Typically there will be NaN loss shown within the first 100 step.
The following methods can fix the NaN loss issue:
- Set
CUDA_LAUNCH_BLOCKING=1
(a good indication that there's race condition). - Set
AllowsAsynchronousDeallocation()=false
for BFC allocator.
Environment
- Reproducible on XLA backend [GPU]: tested on Nvidia-V100 and T4.
- torch_xla version: docker image
gcr.io/tpu-pytorch/xla:nightly_3.7_cuda_11.2
, image_id =e1d95d077920
.
Additional context
It looks BFC allocator by design should support asynchronous deallocation based on the comment in tensorflow:
// The Tensorflow BFC allocator used on GPU allows host-side deallocation
// before GPU execution takes place. Tensorflow uses the ordering of the main
// compute stream to enforce a happens-before relationship between a memory
// allocation and code that reuses the same memory. If Tensorflow adds
// support for multiple GPU streams or allocators with different ordering
// requirements, this code may need to change.
// (This attribute has no effect on CPU.)
bool AllowsAsynchronousDeallocation() const override { return true; }