facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

multi-GPU training throw an illegal memory access

zdwong opened this issue · comments

commented

When I use one GPU to train, there is no problem. But when I use two or four GPUs, the problem come out. The log output:

terminate called after throwing an instance of 'caffe2::EnforceNotMet'
what(): [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator:
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
*** Aborted at 1516866180 (unix time) try "date -d @1516866180" if you are using GNU date ***
terminate called recursively
terminate called recursively
terminate called recursively
PC: @ 0x7ff67559f428 gsignal
terminate called recursively
terminate called recursively
E0125 07:43:00.745853 55683 pybind_state.h:422] Exception encountered running PythonOp function: RuntimeError: [enforce fail at context_gpu.h:307] error == cudaSuccess. 77 vs 0. Error at: /mnt/hzhida/project/caffe2/caffe2/core/context_gpu.h:307: an illegal memory access was encountered

At:
/mnt/hzhida/facebook/detectron/lib/ops/generate_proposals.py(101): forward
*** SIGABRT (@0x3e80000d84f) received by PID 55375 (TID 0x7ff453fff700) from PID 55375; stack trace: ***
terminate called recursively
@ 0x7ff675945390 (unknown)
@ 0x7ff67559f428 gsignal
@ 0x7ff6755a102a abort
@ 0x7ff66f37e84d __gnu_cxx::__verbose_terminate_handler()
@ 0x7ff66f37c6b6 (unknown)
@ 0x7ff66f37c701 std::terminate()
@ 0x7ff66f3a7d38 (unknown)
@ 0x7ff67593b6ba start_thread
@ 0x7ff67567141d clone
@ 0x0 (unknown)
Aborted (core dumped)

I got the same error. The difference is when i use one GPU or two GPUs , there is no problem. But using 4 GPUs to train Mask RCNN (mask_rcnn_R-101-FPN) or RetinaNet (retinanet_R-101-FPN), the same problem occurs.

commented

I have the same problem when I train the tutorial_Res50 network with two or more GPUs.

Encountered same issue when specifying GPU ids (i.e. different from lowest ids, e.g. '1,3,5,7' for 4 GPUs). If lowest GPU ids are specified, training goes on fine.

@jwnsu: we're working on a fix so that when CUDA_VISIBLE_DEVICES does not use the lowest ids training still works. Thanks for reporting and diagnosing.

Hi @jwnsu, @coolbrain, @tshizys, @lwher: we are unable to reproduce this issue on our side.

Can you each provide some more information that might reveal a common pattern?

In particular:

  • Operating system: ?
  • Compiler version: ?
  • CUDA version: ?
  • cuDNN version: ?
  • NVIDIA driver version: ?
  • GPU models (for all devices if they are not all the same): ?
  • Anything else that seems relevant: ?

Here's what we see when training, for example, with GPU ids 1,3,5,7:

CUDA_VISIBLE_DEVICES=1,3,5,7 python2 tools/train_net.py --cfg configs/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_1x.yaml OUTPUT_DIR /tmp/dbg-cvd-train TRAIN.DATASETS "('coco_2014_minival',)" NUM_GPUS 4

Every 0.1s: nvidia-smi                                                                                                                                                                                                                                                                                                                             Fri Jan 26 09:09:26 2018

Fri Jan 26 09:09:26 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M40           On   | 0000:07:00.0     Off |                  Off |
|  0%   42C    P8    17W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M40           On   | 0000:08:00.0     Off |                  Off |
|  0%   51C    P0   144W / 250W |   7214MiB / 12209MiB |     46%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M40           On   | 0000:09:00.0     Off |                  Off |
|  0%   38C    P8    19W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M40           On   | 0000:0A:00.0     Off |                  Off |
|  0%   52C    P0   220W / 250W |   7502MiB / 12209MiB |     38%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla M40           On   | 0000:0B:00.0     Off |                  Off |
|  0%   40C    P8    17W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla M40           On   | 0000:0C:00.0     Off |                  Off |
|  0%   60C    P0    85W / 250W |   7081MiB / 12209MiB |     48%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla M40           On   | 0000:0D:00.0     Off |                  Off |
|  0%   40C    P8    20W / 250W |      0MiB / 12209MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla M40           On   | 0000:0E:00.0     Off |                  Off |
|  0%   56C    P0    81W / 250W |   7494MiB / 12209MiB |     40%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    1   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7210MiB |
|    3   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7498MiB |
|    5   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7077MiB |
|    7   2871837    C   ..............gcc-5-glibc-2.23/bin/python2.7  7490MiB |
+-----------------------------------------------------------------------------+
commented

Operating system: Ubuntu 16.04
Compiler version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0
CUDA version: 8.0
cuDNN version: v5.1
NVIDIA driver version: 384.111

nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+|
| 0 Tesla M60 Off | 00001543:00:00.0 Off | Off |
| N/A 42C P0 41W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 Off | 00003134:00:00.0 Off | Off |
| N/A 42C P0 39W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M60 Off | 00004975:00:00.0 Off | Off |
| N/A 38C P0 41W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M60 Off | 0000F3E6:00:00.0 Off | Off |
| N/A 38C P0 40W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

Operating system: CentOS Linux release 7.1.1503
Compiler version: gcc version 4.8.2
CUDA version: CUDA 8.0
cuDNN version: cuDNN 6.0.21
NVIDIA driver version: 375.26
GPU models: 4x GeForce GTX TITAN X (12G)

nvidia-smi:
image

When using 4 GPUs (0,1,2,3) to train Mask RCNN (e2e_mask_rcnn_R-101-FPN) , RetinaNet (retinanet_R-101-FPN) or Faster RCNN (e2e_faster_rcnn_R-50-FPN), the error “context_gpu.h:307: an illegal memory access was encountered” or “context_gpu.h:170. Encountered CUDA error: an illegal memory access was encountered Error from operator: input: "gpu_0/retnet_cls_pred_fpn3_b_grad" input: "gpu_2/retnet_cls_pred_fpn3_b_grad" output: "gpu_0/retnet_cls_pred_fpn3_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 } ” occurs.

But using one GPU or two GPUS (0,1 or 2,3), it can be trained normally.
Thanks.

@jwnsu: looking at your error more closely ("invalid device ordinal"), it looks like you're trying to train with a config set up for 8 GPUs but restricting the process to have only access to 4 (via CUDA_VISIBLE_DEVICES). The "invalid device ordinal" error is because it's trying to create ops on devices that the process does not have access to.

@coolbrain, @tshizys: thanks for the details. What happens if you use two GPUs using ids {0,2}, {0,3}, {1,2}, or {1,3}?

@rbgirshick you are right, picked wrong config file (with 8 GPUs setting) to try yesterday. Just tried again with the right config file (4 GPUs, error from gpu ids "1,2,4,5", "0,1,2,3" works fine), the error is now similar to what others are seeing:

I0127 09:06:48.220716 10872 context_gpu.cu:325] Total: 20748 MB
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/retnet_bbox_pred_fpn3_b_grad" input: "gpu_2/retnet_bbox_pred_fpn3_b_grad" output: "gpu_0/retnet_bbox_pred_fpn3_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
  what():  [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_2/retnet_cls_conv_n3_fpn3" input: "gpu_2/__m13_shared" output: "gpu_2/__m13_shared" name: "" type: "ReluGradient" arg { name: "cudnn_exhaustive_search" i: 0 } arg { name: "order" s: "NCHW" } device_option { device_type: 1 cuda_gpu_id: 2 } engine: "CUDNN" is_gradient_op: true
*** Aborted at 1517072808 (unix time) try "date -d @1517072808" if you are using GNU date ***
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
terminate called recursively
PC: @     0x7fd71f6bd428 gsignal
*** SIGABRT (@0x3e900002a18) received by PID 10776 (TID 0x7fd548e3d700) from PID 10776; stack trace: ***
    @     0x7fd71fa63390 (unknown)
    @     0x7fd71f6bd428 gsignal
    @     0x7fd71f6bf02a abort
    @     0x7fd71b51c84d __gnu_cxx::__verbose_terminate_handler()
    @     0x7fd71b51a6b6 (unknown)
    @     0x7fd71b51a701 std::terminate()
    @     0x7fd71b545d38 (unknown)
    @     0x7fd71fa596ba start_thread
    @     0x7fd71f78f41d clone
    @                0x0 (unknown)
./itrain4.sh: line 9: 10776 Aborted                 (core dumped) python2 tools/train_net.py --multi-gpu-testing --cfg configs/iret-rn50-fpn-voc.yaml OUTPUT_DIR ./output

@coolbrain, @tshizys: one shot in the dark is to switch the all-reduce implementation to nccl by passing USE_NCCL True to train_net.py, as in:

python2 tools/train_net.py --multi-gpu-testing \
  --cfg configs/getting_started/tutorial_2gpu_e2e_faster_rcnn_R-50-FPN.yaml \
  OUTPUT_DIR /tmp/output USE_NCCL True

This will require Caffe2 to have been built with nccl ops -- I'm not sure if this is done by default or will require some work to rebuild Caffe2 with nccl support.

@rbgirshick , when using two GPUs, i.e. {0,2}, {0,3}, {1,2}, {1,3}, the error still exists. Here is the details, using {0,3} and training RetinaNet (retinanet_R-101-FPN) for example:

F0128 12:09:08.461153 4938 context_gpu.cu:387] Error at: /home/yszhu/local/caffe2/caffe2/core/context_gpu.cu:387: an illegal memory access was encountered
*** Check failure stack trace: ***
terminate called recursively
terminate called recursively
*** Aborted at 1517112548 (unix time) try "date -d @1517112548" if you are using GNU date ***
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
what(): [enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator:
input: "gpu_0/fpn_6_relu" input: "gpu_0/fpn_7_w" input: "gpu_0/__m23_shared" output: "gpu_0/fpn_7_w_grad" output: "gpu_0/fpn_7_b_grad" output: "gpu_0/__m22_shared" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 2 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" is_gradient_op: true
@ 0x7f2bdf712772 google::LogMessage::Fail()
PC: @ 0x0 (unknown)
*** SIGABRT (@0x3e8000012b7) received by PID 4791 (TID 0x7f2a6effd700) from PID 4791; stack trace: ***
@ 0x7f2bdf7126ce google::LogMessage::SendToLog()
@ 0x7f2c2670e130 (unknown)
@ 0x7f2bdf71204c google::LogMessage::Flush()
@ 0x7f2c25c6a5d7 __GI_raise
@ 0x7f2bdf71556d google::LogMessageFatal::~LogMessageFatal()
@ 0x7f2c25c6bcc8 __GI_abort
@ 0x7f2c1b1b1965 __gnu_cxx::__verbose_terminate_handler()
@ 0x7f2bdfdd1180 caffe2::CUDAContext::Delete()
@ 0x7f2c1b1af946 (unknown)
@ 0x7f2be27f42d9 std::_Sp_counted_base<>::_M_release()
@ 0x7f2c1b1af973 std::terminate()
@ 0x7f2c1b2062c5 (unknown)
@ 0x7f2bdfd377d1 caffe2::Tensor<>::ResizeLike<>()
@ 0x7f2c26706df5 start_thread
@ 0x7f2bdfd6e3e2 ZN6caffe210CuDNNState7executeIRZNS_19CudnnConvGradientOp13DoRunWithTypeIffffffffEEbvEUlPS0_E1_EEvP11CUstream_stOT
@ 0x7f2c25d2b1ad __clone
@ 0x7f2bdfd707e1 caffe2::CudnnConvGradientOp::DoRunWithType<>()
@ 0x0 (unknown)

image

The forms of error are not the same each time, but it's just "Encountered CUDA error: an illegal memory access was encountered".

I also rebuild caffe2 with nccl-1.3.5 (following https://caffe2.ai/docs/getting-started.html?platform=centos&configuration=cloud#null__troubleshooting):

image

and switch the all-reduce implementation to nccl by passing USE_NCCL True to train_net.py, as in:

python2 tools/train_net.py --multi-gpu-testing
--cfg configs/12_2017_baselines/retinanet_R-101-FPN_1x_4gpus.yaml
OUTPUT_DIR results_retinanet_R-101-FPN_1x_4gpus_model USE_NCCL True

The error disappeared ^--^ for both using four GPUs {0,1,2,3} or any of two GPUs {0,2}, {0,3}, {1,2}, {1,3}.
@rbgirshick ,thanks very much.

commented

Hi, I open the nccl op to train the tutorial_network and the error above disappeared. However, the program hangs after loading data and occupy 100% CPU all the time.

.......
I0129 03:25:13.106998 118074 context_gpu.cu:321] GPU 0: 2175 MB
I0129 03:25:13.107028 118074 context_gpu.cu:321] GPU 1: 2078 MB
I0129 03:25:13.107045 118074 context_gpu.cu:321] GPU 2: 2266 MB
I0129 03:25:13.107059 118074 context_gpu.cu:321] GPU 3: 1860 MB
I0129 03:25:13.107072 118074 context_gpu.cu:325] Total: 8381 MB
I0129 03:25:13.122316 118079 context_gpu.cu:321] GPU 0: 2195 MB
I0129 03:25:13.122344 118079 context_gpu.cu:321] GPU 1: 2145 MB
I0129 03:25:13.122361 118079 context_gpu.cu:321] GPU 2: 2267 MB
I0129 03:25:13.122378 118079 context_gpu.cu:321] GPU 3: 1924 MB
I0129 03:25:13.122395 118079 context_gpu.cu:325] Total: 8532 MB
I0129 03:25:13.151623 118079 context_gpu.cu:321] GPU 0: 2245 MB
I0129 03:25:13.151650 118079 context_gpu.cu:321] GPU 1: 2159 MB
I0129 03:25:13.152823 118079 context_gpu.cu:321] GPU 2: 2269 MB
I0129 03:25:13.153623 118079 context_gpu.cu:321] GPU 3: 2020 MB
I0129 03:25:13.154454 118079 context_gpu.cu:325] Total: 8694 MB
I0129 03:25:13.186017 118079 context_gpu.cu:321] GPU 0: 2260 MB
I0129 03:25:13.186053 118079 context_gpu.cu:321] GPU 1: 2214 MB
I0129 03:25:13.186067 118079 context_gpu.cu:321] GPU 2: 2279 MB
I0129 03:25:13.186077 118079 context_gpu.cu:321] GPU 3: 2080 MB
I0129 03:25:13.186089 118079 context_gpu.cu:325] Total: 8835 MB
I0129 03:25:13.215306 118076 context_gpu.cu:321] GPU 0: 2310 MB
I0129 03:25:13.215342 118076 context_gpu.cu:321] GPU 1: 2269 MB
I0129 03:25:13.215351 118076 context_gpu.cu:321] GPU 2: 2308 MB
I0129 03:25:13.215368 118076 context_gpu.cu:321] GPU 3: 2081 MB
I0129 03:25:13.215384 118076 context_gpu.cu:325] Total: 8970 MB
I0129 03:25:13.307595 118084 context_gpu.cu:321] GPU 0: 2310 MB
I0129 03:25:13.307623 118084 context_gpu.cu:321] GPU 1: 2301 MB
I0129 03:25:13.307641 118084 context_gpu.cu:321] GPU 2: 2391 MB
I0129 03:25:13.307652 118084 context_gpu.cu:321] GPU 3: 2104 MB
I0129 03:25:13.307665 118084 context_gpu.cu:325] Total: 9108 MB
I0129 03:25:13.324935 118077 context_gpu.cu:321] GPU 0: 2312 MB
I0129 03:25:13.324965 118077 context_gpu.cu:321] GPU 1: 2313 MB
I0129 03:25:13.324982 118077 context_gpu.cu:321] GPU 2: 2452 MB
I0129 03:25:13.324993 118077 context_gpu.cu:321] GPU 3: 2171 MB
I0129 03:25:13.325011 118077 context_gpu.cu:325] Total: 9250 MB
I0129 03:25:13.343673 118080 context_gpu.cu:321] GPU 0: 2336 MB
I0129 03:25:13.343698 118080 context_gpu.cu:321] GPU 1: 2380 MB
I0129 03:25:13.343715 118080 context_gpu.cu:321] GPU 2: 2468 MB
I0129 03:25:13.343731 118080 context_gpu.cu:321] GPU 3: 2233 MB
I0129 03:25:13.343747 118080 context_gpu.cu:325] Total: 9417 MB
I0129 03:25:13.369802 118085 cuda_nccl_gpu.cc:110] Creating NCCLContext for key: 0:0,1,2,3,
I0129 03:25:13.381914 118076 context_gpu.cu:321] GPU 0: 2361 MB
I0129 03:25:13.381942 118076 context_gpu.cu:321] GPU 1: 2453 MB
I0129 03:25:13.381961 118076 context_gpu.cu:321] GPU 2: 2524 MB
I0129 03:25:13.381978 118076 context_gpu.cu:321] GPU 3: 2247 MB
I0129 03:25:13.381995 118076 context_gpu.cu:325] Total: 9587 MB
I0129 03:25:13.613253 118083 context_gpu.cu:321] GPU 0: 2388 MB
I0129 03:25:13.613292 118083 context_gpu.cu:321] GPU 1: 2525 MB
I0129 03:25:13.613301 118083 context_gpu.cu:321] GPU 2: 2524 MB
I0129 03:25:13.613308 118083 context_gpu.cu:321] GPU 3: 2310 MB
I0129 03:25:13.613315 118083 context_gpu.cu:325] Total: 9748 MB

the program hangs......

my environment:
Operating system: Ubuntu 16.04
Compiler version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0
CUDA version: 8.0
cuDNN version: v5.1
NVIDIA driver version: 384.111

nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+|
| 0 Tesla M60 Off | 00001543:00:00.0 Off | Off |
| N/A 42C P0 41W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 Off | 00003134:00:00.0 Off | Off |
| N/A 42C P0 39W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M60 Off | 00004975:00:00.0 Off | Off |
| N/A 38C P0 41W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M60 Off | 0000F3E6:00:00.0 Off | Off |
| N/A 38C P0 40W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

@lwher: that's unfortunate. The reason we don't use NCCL by default is that it's prone to causing deadlocks, which is what I think you're seeing.

commented

After rebuilding caffe2 with NCCL, I rerun the program with this script:
python tools/train_net.py
--multi-gpu-testing
--cfg configs/getting_started/tutorial_4gpu_e2e_faster_rcnn_R-50-FPN.yaml
OUTPUT_DIR ./output USE_NCCL True

It throws this error:

Creating NCCLContext for key: 0:0,1,2,3,
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
what(): [enforce fail at cuda_nccl_gpu.cc:40] status == ncclSuccess. 2 vs 0. Error at: /mnt/hzhida/project/caffe2/caffe2/contrib/nccl/cuda_nccl_gpu.cc40: system error Error from operator:
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" input: "gpu_2/rpn_cls_logits_fpn2_w_grad" input: "gpu_3/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" output: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_2/rpn_cls_logits_fpn2_w_grad" output: "gpu_3/rpn_cls_logits_fpn2_w_grad" name: "" type: "NCCLAllreduce" device_option { device_type: 1 cuda_gpu_id: 0 }
*** Aborted at 1517210588 (unix time) try "date -d @1517210588" if you are using GNU date ***
PC: @ 0x7ff1e0383428 gsignal
*** SIGABRT (@0x3e800007a46) received by PID 31302 (TID 0x7fefb5ffb700) from PID 31302; stack trace: ***
I0129 07:23:08.187249 31591 cuda_nccl_gpu.cc:110] Creating NCCLContext for key: 0:0,1,2,3,

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your
NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64.
libnvidia-ml.so in GDK package is a stub library that is attached only for
build purposes (e.g. machine that you build your application doesn't have
to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
terminate called recursively
@ 0x7ff1e0729390 (unknown)
I0129 07:23:08.188051 31592 context_gpu.cu:321] GPU 0: 2466 MB
I0129 07:23:08.188074 31592 context_gpu.cu:321] GPU 1: 2387 MB
I0129 07:23:08.188091 31592 context_gpu.cu:321] GPU 2: 2311 MB
I0129 07:23:08.188099 31592 context_gpu.cu:321] GPU 3: 2382 MB
I0129 07:23:08.188107 31592 context_gpu.cu:325] Total: 9548 MB
@ 0x7ff1e0383428 gsignal
@ 0x7ff1e038502a abort
@ 0x7ff1da16284d __gnu_cxx::__verbose_terminate_handler()
@ 0x7ff1da1606b6 (unknown)
@ 0x7ff1da160701 std::terminate()
@ 0x7ff1da18bd38 (unknown)
@ 0x7ff1e071f6ba start_thread
@ 0x7ff1e045541d clone
@ 0x0 (unknown)
Aborted (core dumped)

Running Environment:
Operating system: Ubuntu 16.04
Compiler version: gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0
CUDA version: 8.0
cuDNN version: v5.1
NVIDIA driver version: 384.111

nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+|
| 0 Tesla M60 Off | 00001543:00:00.0 Off | Off |
| N/A 42C P0 41W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 Off | 00003134:00:00.0 Off | Off |
| N/A 42C P0 39W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M60 Off | 00004975:00:00.0 Off | Off |
| N/A 38C P0 41W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M60 Off | 0000F3E6:00:00.0 Off | Off |
| N/A 38C P0 40W / 150W | 0MiB / 8123MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

One additional note about NCCL: Caffe2 builds with NCCL by default so there is no need to rebuild it.

Jumping onto this: since the illegal memory access is from the Add operator, you might want to check if direct peer access is available between the gpus that you are using. Current Add op relies on that, and if not we might want to fix the code indeed. Basically, to do so, in python, do:

from caffe2.python import workspace
print(workspace.GetCudaPeerAccessPattern())

Could you paste the output of that for debugging? (Especially, if you are using CUDA_VISIBLE_DEVICES, make sure you invoke python with that too)

@Yangqing output from your two debug lines:

[[ True  True False False]
[ True  True False False]
[False False  True  True]
[False False  True  True]]

thx for looking into this issue (and ... caffe/caffe2 frameworks!)

@jwnsu thanks! Just to confirm, so the Add operator is adding tensors across gpu {0,1} and {2,3} right? (I assume it is adding stuff together from the 4 gpus).

It's 4 gpus config, with GPU ids specified as "0,1,2,4" (via CUDA_VISIBLE_DEVICES.) If GPU ids are configured as "0,1,2,3" (lowest GPU ids), it works fine without any error.

@Yangqing
My Linux Server have 4 M60 GPUs,
This is my workspace.GetCudaPeerAccessPattern() output:
[[ True False False False]
[False True False False]
[False False True False]
[False False False True]]

I can train net using 1 gpu well, but when I train net using 2 or 4 GPUS, I meet problems the same above, even I set NCCL = True

Thanks guys. This verifies my assumption that the illegal memory access comes from the Add op not properly handling cross-device communications when peer access is not enabled. Will issue a fix.

same problem in cross-device communications...
this machine can use 4 GPU[0,1,2,3]:
image
this machine can use [0,1] and [2,3]:
image

BTW, I have use 12 Cpu and 4 titan x to train 3D Faster RCNN in pytorch framework . Why Pytorch doesn't have this problem ????

commented

@Yangqing Because I can't train Detectron in multi-GPU, so I want to know how long will you fix Cross-GPU communications problem? thanks.

@Yangqing I ran into similar problems as above. My Linux workstation has 2 GTX-1080Ti. The error infos are as follow:
[enforce fail at context_gpu.h:170] . Encountered CUDA error: an illegal memory access was encountered Error from operator: input: "gpu_0/rpn_cls_logits_fpn2_b_grad" input: "gpu_1/rpn_cls_logits_fpn2_b_grad" output: "gpu_0/rpn_cls_logits_fpn2_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
and my workspace.GetCudaPeerAccessPattern() output is:
[[True False]
[False True]]
Whether is it a Cross-GPU communications problem too? If not, anyone can help me fix it,thanks.

commented

Yes,it is the same problem. The gradients in cross-GPU can't add together because GPUs can't communicate with each other. if you want to solve the problem, maybe you could copy the gradients from GPU to CPU, then sum them up together and average them. And at last, copy average gradient from CPU to GPU. @blateyang

Thanks for your advice! @coolbrain But I can't understand why some people can successfully train model with two or more GPUs. Haven't they met the same Cross-GPU communications problem?

Training of 4 GPUs with either lowest GPU ids (0,1,2,3) or highest GPU ids (4,5,6,7) works here without any error (8 gpus might work too, but have not tried it yet.) It only has issue with mix of particular ids, e.g. "0,1,2,4" or "1,3,5,7".

Suspect caffe2 cross-GPU communication issue may behave differently with individual hardware build (rbgirshick mentioned earlier Facebook M40 server works with mix of ids too).

Come across the same problem. Is this fixed?

I met the same problem on a workstation with 4 GTX 1080TI GPUS. Multi-gpu works well on other platform, such as caffe and tensorflow.
This is my workspace.GetCudaPeerAccessPattern() output:
[[ True True False False]
[ True True False False]
[False False True True]
[False False True True]]
The two-gpu config (with {0,1} or {2,3}) works well. Three or Four gpus will face the aforementioned problem. However, My error is not on the Add operation, I remember the type is Copy

Has the issue been fixed?

@rbgirshick Hi, I met the same problem as @lwher. The program seems to get stuck with almost a 50% chance with NCCL on my machine with Ubuntu 14.04 and 4 GPUs. Is there a solution to avoid such behaviors of NCCL? Many thanks!

@Yangqing Hi, I met the same issue in the Copy operator.
When I don't add the USE_NCCL True flag, the errors are as follows:

E0325 02:26:02.258566  8284 operator_schema.cc:73] Input index 0 and output idx 0 (gpu_0/res3_0_branch2a_w_grad) are set to be in-place but this is actually not supported by op Copy
Original python traceback for operator 2817 in network `generalized_rcnn` in exception above (most recent call last):
  File "tools/train_net.py", line 358, in <module>
  File "tools/train_net.py", line 196, in main
  File "tools/train_net.py", line 205, in train_model
  File "tools/train_net.py", line 283, in create_model
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 120, in create
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 92, in generalized_rcnn
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 254, in build_generic_detection_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 42, in build_data_parallel_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 84, in _add_allreduce_graph
  File "/home/shuqin/git/caffe2/build/caffe2/python/muji.py", line 64, in Allreduce
  File "/home/shuqin/git/caffe2/build/caffe2/python/muji.py", line 204, in AllreduceFallback
Traceback (most recent call last):
  File "tools/train_net.py", line 358, in <module>
    main()
  File "tools/train_net.py", line 196, in main
    checkpoints = train_model()
  File "tools/train_net.py", line 210, in train_model
    setup_model_for_training(model, output_dir)
  File "tools/train_net.py", line 316, in setup_model_for_training
    workspace.CreateNet(model.net)
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 166, in CreateNet
    StringifyProto(net), overwrite,
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 192, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at operator.cc:125] schema->Verify(operator_def). Operator def did not pass schema checking: input: "gpu_0/res3_0_branch2a_w_grad" output: "gpu_0/res3_0_branch2a_w_grad" name: "" type: "Copy" device_option { device_type: 1 cuda_gpu_id: 0 }

If I added the USE_NCCL True flag, the errors then become:

Original python traceback for operator 2928 in network `generalized_rcnn` in exception above (most recent call last):
  File "tools/train_net.py", line 358, in <module>
  File "tools/train_net.py", line 196, in main
  File "tools/train_net.py", line 205, in train_model
  File "tools/train_net.py", line 283, in create_model
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 120, in create
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 92, in generalized_rcnn
  File "/home/shuqin/git/RefineNet/lib/modeling/model_builder.py", line 254, in build_generic_detection_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 42, in build_data_parallel_model
  File "/home/shuqin/git/RefineNet/lib/modeling/optimizer.py", line 82, in _add_allreduce_graph
Traceback (most recent call last):
  File "tools/train_net.py", line 358, in <module>
    main()
  File "tools/train_net.py", line 196, in main
    checkpoints = train_model()
  File "tools/train_net.py", line 217, in train_model
    workspace.RunNet(model.net.Proto().name)
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 230, in RunNet
    StringifyNetName(name), num_iter, allow_fail,
  File "/home/shuqin/git/caffe2/build/caffe2/python/workspace.py", line 192, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at cuda_nccl_gpu.cc:40] status == ncclSuccess. 2 vs 0.  Error at: /home/shuqin/git/caffe2/caffe2/contrib/nccl/cuda_nccl_gpu.cc40: system error Error from operator:
input: "gpu_0/rpn_cls_logits_fpn2_b_grad" input: "gpu_1/rpn_cls_logits_fpn2_b_grad" input: "gpu_2/rpn_cls_logits_fpn2_b_grad" output: "gpu_0/rpn_cls_logits_fpn2_b_grad" output: "gpu_1/rpn_cls_logits_fpn2_b_grad" output: "gpu_2/rpn_cls_logits_fpn2_b_grad" name: "" type: "NCCLAllreduce" device_option { device_type: 1 cuda_gpu_id: 0 }

My system is Ubuntu 14.04, with Cuda 8.0 and Cudnn 5.1 . My machine has 8 GPUs but I tested the code only on the last 4, so the communication between GPUs should be no problem. I use NCCL 2.1.15 for CUDA 8.0.

Hope this issue can be fixed soon. It's pretty annoying.

commented

This problem still exists, right?

By adding 'USE_NCLL True' when runing multi-GPU training, I successfully get my training started. Although sometimes deadlock may happen, you can try to modify some training params such as learning rate to solve it.

commented

The problem still exists.

commented

@xieshuqin I met the same problem 'status == ncclSuccess. 2 vs 0.' with you when use 'USE_NCCL True'.How do you solve this problem?Thanks

@pkuxwguan My issue has been fixed but I forgot how did I fix it. Sorry about that. But I do remember the problem should be related to the wrong installation of NCCL.

Hi all, I also suffered from this issue, so I finally fixed it by myself. pytorch/pytorch#6896 solved this issue :)

anybody tells me whether I can run mask r-cnn with only one GPU?

@daquexian I tried your PR, it works!!! Thanks very much

@daquexian This PR doesn't appear to work for me. I'm experiencing deadlocks while using a single GPU without NCCL and also while using 2 GPUs with USE_NCCL True. After changing muji.py according to your PR and running with 2 GPUs with USE_NCCL True, I'm still experiencing a deadlock; the training just pauses at random iteration numbers.

Maybe I'm missing something, but if I set USE_NCCL=False, and use your modified muji.py and muji_test.py PR, I get the original error:

I0502 14:35:57.192476 79712 context_gpu.cu:318] Total: 23025 MB
E0502 14:35:58.382604 79711 net_dag.cc:195] Exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/rpn_cls_logits_fpn2_b_grad" input: "gpu_1/rpn_cls_logits_fpn2_b_grad" output: "gpu_0/rpn_cls_logits_fpn2_b_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
E0502 14:35:58.382622 79712 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
F0502 14:35:58.382670 79711 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 14:35:58.382683 79712 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
E0502 14:35:58.383510 79709 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_1/fpn_res3_3_sum" input: "gpu_1/conv_rpn_fpn2_w" input: "gpu_1/__m18_shared" output: "_gpu_1/conv_rpn_fpn2_w_grad_autosplit_2" output: "_gpu_1/conv_rpn_fpn2_b_grad_autosplit_2" output: "_gpu_1/fpn_res3_3_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 1 } engine: "CUDNN" is_gradient_op: true
E0502 14:35:58.383541 79713 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at conv_op_cudnn.cc:1290] status == CUDNN_STATUS_SUCCESS. 8 vs 0. , Error at: /home/markable-ai/pytorch/caffe2/operators/conv_op_cudnn.cc:1290: CUDNN_STATUS_EXECUTION_FAILED Error from operator: 
input: "gpu_3/conv_rpn_fpn4" input: "gpu_3/rpn_bbox_pred_fpn2_w" input: "gpu_3/rpn_bbox_pred_fpn4_grad" output: "_gpu_3/rpn_bbox_pred_fpn2_w_grad_autosplit_1" output: "_gpu_3/rpn_bbox_pred_fpn2_b_grad_autosplit_1" output: "gpu_3/__m13_shared" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
E0502 14:35:58.383591 79706 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_3/conv_rpn_fpn3" input: "gpu_3/rpn_cls_logits_fpn2_w" input: "gpu_3/rpn_cls_logits_fpn3_grad" output: "_gpu_3/rpn_cls_logits_fpn2_w_grad_autosplit_2" output: "_gpu_3/rpn_cls_logits_fpn2_b_grad_autosplit_2" output: "_gpu_3/conv_rpn_fpn3_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
F0502 14:35:58.382683 79712 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encounteredF0502 14:35:58.434631 79709 context_gpu.h:107] FCheck failed: error == cudaSuccess an illegal memory access was encountered0502 14:35:58.434648 79713 c*** Check failure stack trace: ***
E0502 14:35:58.383741 79700 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_3/conv_rpn_fpn2" input: "gpu_3/rpn_cls_logits_fpn2_w" input: "gpu_3/rpn_cls_logits_fpn2_grad" output: "_gpu_3/rpn_cls_logits_fpn2_w_grad_autosplit_3" output: "_gpu_3/rpn_cls_logits_fpn2_b_grad_autosplit_3" output: "_gpu_3/conv_rpn_fpn2_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 1 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 0 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 3 } engine: "CUDNN" is_gradient_op: true
Aborted (core dumped)

I'm using Cuda 9.1, cudnn 7.1 with 4 V100s.

@Feynman27 Could you tell me which branch(like Allreduce4, Allreduce4Group2, Allreduce2 or others) of Allreduce in the updated muji.py is entered? You might want to add some print functions in these branch to know it. And what if you replace the implementation of Allreduce by just calling AllreduceFallback? It will be great if you can also provide your gpu access pattern like #32 (comment). Thanks!

Allreduce4 is being called. The gpu access pattern is:

>>> from caffe2.python import workspace
>>> print(workspace.GetCudaPeerAccessPattern())
[[ True False False False]
 [False  True False False]
 [False False  True False]
 [False False False  True]]

I'll try calling AllreduceFallback.

Calling AllreduceFallback gives a similar error as above:

I0502 17:08:51.294476 88651 context_gpu.cu:318] Total: 22524 MB
E0502 17:08:52.009866 88659 net_dag.cc:195] Exception from operator chain starting at '' (type 'Add'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:156] . Encountered CUDA error: an illegal memory access was encountered Error from operator: 
input: "gpu_0/rpn_cls_logits_fpn2_w_grad" input: "gpu_1/rpn_cls_logits_fpn2_w_grad" output: "gpu_0/rpn_cls_logits_fpn2_w_grad" name: "" type: "Add" device_option { device_type: 1 cuda_gpu_id: 0 }
F0502 17:08:52.009990 88659 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
E0502 17:08:52.010440 88651 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_2/fpn_res3_3_sum" input: "gpu_2/conv_rpn_fpn2_w" input: "gpu_2/__m15_shared" output: "_gpu_2/conv_rpn_fpn2_w_grad_autosplit_2" output: "_gpu_2/conv_rpn_fpn2_b_grad_autosplit_2" output: "_gpu_2/fpn_res3_3_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 2 } engine: "CUDNN" is_gradient_op: true
E0502 17:08:52.010524 88663 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_1/fpn_res2_2_sum" input: "gpu_1/conv_rpn_fpn2_w" input: "gpu_1/__m12_shared" output: "_gpu_1/conv_rpn_fpn2_w_grad_autosplit_3" output: "_gpu_1/conv_rpn_fpn2_b_grad_autosplit_3" output: "_gpu_1/fpn_res2_2_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 1 } engine: "CUDNN" is_gradient_op: true
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
E0502 17:08:52.010577 88653 net_dag.cc:195] Secondary exception from operator chain starting at '' (type 'ConvGradient'): caffe2::EnforceNotMet: [enforce fail at context_gpu.cu:336] error == cudaSuccess. 77 vs 0. Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:336: an illegal memory access was encountered Error from operator: 
input: "gpu_0/fpn_res4_22_sum" input: "gpu_0/conv_rpn_fpn2_w" input: "gpu_0/__m15_shared" output: "_gpu_0/conv_rpn_fpn2_w_grad_autosplit_1" output: "_gpu_0/conv_rpn_fpn2_b_grad_autosplit_1" output: "_gpu_0/fpn_res4_22_sum_grad_autosplit_0" name: "" type: "ConvGradient" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN" is_gradient_op: true
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
07] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
F0502 17:08:52.010545 88660 context_gpu.cu:387] Error at: /home/markable-ai/pytorch/caffe2/core/context_gpu.cu:387: an illegal memory access was encounteredF0502 17:08:52.061641 88651 context_gpu.hF107] 502 17:Ch:ck failed: error == cudaSuccess 52.061651 88663 context_gpu.h:
07] Check failed: error == cudaSuccess an illegal memory access was encounteredF0502 17:08:52.061749 88653 context_gpu.h:107] Check failed: error == cudaSuccess an illegal memory access was encountered
*** Check failure stack trace: ***
Aborted (core dumped

@Feynman27 It's strange. According to your gpu access pattern, AllreduceFallback instead of Allreduce4 will be called. And when you called AllreduceFallback manually, the error message doesn't appear to be came from AllreduceFallback. Did you change the muji.py in right folder? For example, if the python package of caffe2 is in /usr/lib/python/site-packages/caffe2, then changing the muji.py in caffe2's source folder(like ~/caffe2/python) will not work.

@Feynman27 did you rebuild the caffe2 ?

@daquexian The caffe2 package is installed under pytorch/caffe2, not /usr/lib/python/site-packages/caffe2 or anything else. I've set my $PYTHONPATH to look in this directory. I've also confirmed this by:

Python 2.7.14 |Anaconda, Inc.| (default, Mar 27 2018, 17:29:31) 
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import caffe2
>>> caffe2.__file__
'/home/markable-ai/pytorch/build/caffe2/__init__.pyc'
>>> from caffe2.python import muji
>>> muji.__file__
'/home/markable-ai/pytorch/build/caffe2/python/muji.pyc'
>>> 

I simply modified the muji.py file under pytorch/caffe2/python/muji.py.

@yuzcccc I didn't rebuild caffe2, but why would I have to? I'm only modifying a python file.

@Feynman27 I think you should modify muji.py under /home/markable-ai/pytorch/build/caffe2/python/muji.py

Yep, that was my oversight. Good catch. I was modifying pytorch/caffe2/python/muji.py and should have modified pytorch/build/caffe2/python/muji.py.

@Feynman27 It's happy to see it working :)
@Yangqing Could you please review my pr pytorch/pytorch#6896? It may help many detectron users :)

@daquexian Unfortunately, I still seem to be experiencing deadlocks.

@Feynman27 Hmm.. What is the value of USE_NCCL? It should be False

Yes, USE_NCCL was set to false.

@Feynman27 Sorry I have no idea why it will cause deadlock. It's hard to reproduce for me

Fair enough. For all I know, the deadlock I'm experiencing could be unrelated to whether or not GPU peer access is enabled. Your PR definitely allowed me to start training with USE_NCCL=False. I'm running on Azure machines, so it could be related to running on their VMs. I've started training on local machines with 2 TitanXs and the training seems to be progressing just fine.

@daquexian Thanks! Your PR worked for me!

Looks like this issue can be closed.

@gadcam thanks for helping to identify issues that can be closed!

For this one, I'd like to leave it open until there's a fix merged into Caffe2.

@rbgirshick Unfortunately no one reviews my PR :|

@rbgirshick Thanks! My PR pytorch/pytorch#6896 has been merged. It looks like this issue can be closed :)