train error, RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR

Question

train error, RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR

guohaoyuan opened this issue 5 years ago · comments

loading all datasets...
using 4 threads
loading from cache file: cache/coco_trainval2014.pkl
No cache file found...
loading annotations into memory...
Done (t=7.66s)
creating index...
index created!
82783it [00:23, 3500.57it/s]
loading annotations into memory...
Done (t=7.17s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=7.03s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=5.85s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=6.67s)
creating index...
index created!
loading from cache file: cache/coco_minival2014.pkl
No cache file found...
loading annotations into memory...
Done (t=2.40s)
creating index...
index created!
40504it [00:11, 3429.47it/s]
loading annotations into memory...
Done (t=5.35s)
creating index...
index created!
system config...
{'batch_size': 2,
'cache_dir': 'cache',
'chunk_sizes': [2],
'config_dir': 'config',
'data_dir': '../data',
'data_rng': <mtrand.RandomState object at 0x7f5488fbfea0>,
'dataset': 'MSCOCO',
'decay_rate': 10,
'display': 5,
'learning_rate': 0.00025,
'max_iter': 480000,
'nnet_rng': <mtrand.RandomState object at 0x7f5488fbff30>,
'opt_algo': 'adam',
'prefetch_size': 6,
'pretrain': None,
'result_dir': 'results',
'sampling_function': 'kp_detection',
'snapshot': 5000,
'snapshot_name': 'CenterNet-52',
'stepsize': 450000,
'test_split': 'testdev',
'train_split': 'trainval',
'val_iter': 500,
'val_split': 'minival',
'weight_decay': False,
'weight_decay_rate': 1e-05,
'weight_decay_type': 'l2'}
db config...
{'ae_threshold': 0.5,
'border': 128,
'categories': 80,
'data_aug': True,
'gaussian_bump': True,
'gaussian_iou': 0.7,
'gaussian_radius': -1,
'input_size': [511, 511],
'kp_categories': 1,
'lighting': True,
'max_per_image': 100,
'merge_bbox': False,
'nms_algorithm': 'exp_soft_nms',
'nms_kernel': 3,
'nms_threshold': 0.5,
'output_sizes': [[128, 128]],
'rand_color': True,
'rand_crop': True,
'rand_pushes': False,
'rand_samples': False,
'rand_scale_max': 1.4,
'rand_scale_min': 0.6,
'rand_scale_step': 0.1,
'rand_scales': array([ 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
'special_crop': False,
'test_scales': [1],
'top_k': 70,
'weight_exp': 8}
len of db: 82783
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
building model...
module_file: models.CenterNet-52
start prefetching data...
shuffling indices...
/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/cuda/init.py:114: UserWarning:
Found GPU0 TITAN RTX which requires CUDA_VERSION >= 9000 for
optimal performance and fast startup time, but your PyTorch was compiled
with CUDA_VERSION 8000. Please install the correct PyTorch binary
using instructions from http://pytorch.org

warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))
/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/cuda/init.py:114: UserWarning:
Found GPU1 TITAN RTX which requires CUDA_VERSION >= 9000 for
optimal performance and fast startup time, but your PyTorch was compiled
with CUDA_VERSION 8000. Please install the correct PyTorch binary
using instructions from http://pytorch.org

warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))
total parameters: 104844152
setting learning rate to: 0.00025
training start...
0%| | 0/480000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 203, in
train(training_dbs, validation_db, args.start_iter)
File "train.py", line 138, in train
training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(**training)
File "/home/yangxilab/GHY/GHY/CenterNet/nnet/py_factory.py", line 82, in train
loss_kp = self.network(xs, ys)
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet/models/py_utils/data_parallel.py", line 70, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet/models/py_utils/data_parallel.py", line 80, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 77, in parallel_apply
raise output
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 53, in _worker
output = module(*input, **kwargs)
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet/nnet/py_factory.py", line 20, in forward
preds = self.model(*xs, **kwargs)
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet/nnet/py_factory.py", line 32, in forward
return self.module(*xs, **kwargs)
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet/models/py_utils/kp.py", line 289, in forward
return self._train(*xs, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet/models/py_utils/kp.py", line 193, in _train
inter = self.pre(image)
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet/models/py_utils/utils.py", line 14, in forward
conv = self.conv(x)
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/anaconda3/envs/CenterNet/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CuDNN error: CUDNN_STATUS_MAPPING_ERROR
#####################
Help the poor boy，please

UmarSpa · Answer 1 · Fri Nov 08 2019 19:59:01 GMT+0800 (China Standard Time)

I have the same problem.

UmarSpa · Answer 2 · Mon Nov 11 2019 17:02:23 GMT+0800 (China Standard Time)

Solved it by running the code with pytorch 1.0: https://github.com/UmarSpa/CenterNet

guohaoyuan · Answer 3 · Fri Nov 15 2019 11:22:08 GMT+0800 (China Standard Time)

Solved it by running the code with pytorch 1.0: https://github.com/UmarSpa/CenterNet

loading all datasets...
using 4 threads
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=6.77s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=6.04s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=7.50s)
creating index...
index created!
loading from cache file: cache/coco_trainval2014.pkl
loading annotations into memory...
Done (t=6.41s)
creating index...
index created!
loading from cache file: cache/coco_minival2014.pkl
loading annotations into memory...
Done (t=1.92s)
creating index...
index created!
system config...
{'batch_size': 8,
'cache_dir': 'cache',
'chunk_sizes': [4, 4],
'config_dir': 'config',
'data_dir': './data',
'data_rng': <mtrand.RandomState object at 0x7f0ec45307e0>,
'dataset': 'MSCOCO',
'decay_rate': 10,
'display': 5,
'learning_rate': 0.00025,
'max_iter': 480000,
'nnet_rng': <mtrand.RandomState object at 0x7f0ec4530828>,
'opt_algo': 'adam',
'prefetch_size': 6,
'pretrain': None,
'result_dir': 'results',
'sampling_function': 'kp_detection',
'snapshot': 5000,
'snapshot_name': 'CenterNet-52',
'stepsize': 450000,
'test_split': 'testdev',
'train_split': 'trainval',
'val_iter': 500,
'val_split': 'minival',
'weight_decay': False,
'weight_decay_rate': 1e-05,
'weight_decay_type': 'l2'}
db config...
{'ae_threshold': 0.5,
'border': 128,
'categories': 80,
'data_aug': True,
'gaussian_bump': True,
'gaussian_iou': 0.7,
'gaussian_radius': -1,
'input_size': [511, 511],
'kp_categories': 1,
'lighting': True,
'max_per_image': 100,
'merge_bbox': False,
'nms_algorithm': 'exp_soft_nms',
'nms_kernel': 3,
'nms_threshold': 0.5,
'output_sizes': [[128, 128]],
'rand_color': True,
'rand_crop': True,
'rand_pushes': False,
'rand_samples': False,
'rand_scale_max': 1.4,
'rand_scale_min': 0.6,
'rand_scale_step': 0.1,
'rand_scales': array([0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3]),
'special_crop': False,
'test_scales': [1],
'top_k': 70,
'weight_exp': 8}
len of db: 82783
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
shuffling indices...
start prefetching data...
building model...
module_file: models.CenterNet-52
shuffling indices...
total parameters: 104844152
setting learning rate to: 0.00025
training start...
0%| | 0/480000 [00:00<?, ?it/s]THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCGeneral.cpp line=405 error=11 : invalid argument

Traceback (most recent call last):
File "train.py", line 203, in
train(training_dbs, validation_db, args.start_iter)
File "train.py", line 138, in train
training_loss, focal_loss, pull_loss, push_loss, regr_loss = nnet.train(**training)
File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/nnet/py_factory.py", line 82, in train
loss_kp = self.network(xs, ys)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/models/py_utils/data_parallel.py", line 70, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/models/py_utils/data_parallel.py", line 80, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
raise output
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
output = module(*input, **kwargs)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/nnet/py_factory.py", line 20, in forward
preds = self.model(*xs, **kwargs)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/nnet/py_factory.py", line 32, in forward
return self.module(*xs, **kwargs)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/models/py_utils/kp.py", line 289, in forward
return self._train(*xs, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/models/py_utils/kp.py", line 193, in _train
inter = self.pre(image)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/GHY/GHY/CenterNet1.0-master/models/py_utils/utils.py", line 14, in forward
conv = self.conv(x)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call
result = self.forward(*input, **kwargs)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 320, in forward
self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (11) : invalid argument at /opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCGeneral.cpp:405
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 51, in pin_memory
data = data_queue.get()
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 256, in rebuild_storage_fd
fd = df.detach()
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 737, in answer_challenge
response = connection.recv_bytes(256) # reject large message
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer

Exception in thread Thread-2:
Traceback (most recent call last):
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 51, in pin_memory
data = data_queue.get()
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 256, in rebuild_storage_fd
fd = df.detach()
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
ConnectionResetError: [Errno 104] Connection reset by peer
Fatal Python error: could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads

Thread 0x00007f0d33ab2700 (most recent call first):
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 926 in _bootstrap_inner
File "/home/yangxilab/anaconda3/envs/CenterNet-PT10/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007f0f1b1a8700 (most recent call first):
Aborted (core dumped)
#####################################################
thank you for your code! but I meet this problem. It doesn't look like the reason for batch size and chunk'sizes

guohaoyuan · Answer 4 · Fri Nov 15 2019 11:42:10 GMT+0800 (China Standard Time)

Solved it by running the code with pytorch 1.0: https://github.com/UmarSpa/CenterNet
thank you for your arts again! I have solve this problem! cuda is responsible for this problem