wutianyiRosun / CGNet

CGNet: A Light-weight Context Guided Network for Semantic Segmentation [IEEE Transactions on Image Processing 2020]

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CUDA run time error for the python cityscape train_code

nithishc829 opened this issue · comments

commented

Getting error while training on Cityscape Data set
this my training configuration

python cityscapes_train.py --gpus "3,4" --data_dir ~/data/Cityscape_2017/ --dataset cityscapes --train_type ontrainval --train_data_list ~/data/Cityscape_2017/cityscapes_trainval_list.txt --max_epochs 350 --cuda True --scaleIn 1 --batch_size 4

code ran and printed
=====> use gpu id: '3,4'
====> Random Seed: 457
=====> current architeture: CGNet
=====> computing network parameters
the number of params: 0.50 M
the number of parameters: 496306
data['classWeights']: [ 1.4705521 9.505282 10.492059 10.492059 10.492059 10.492059
10.492059 10.492059 10.492059 10.492059 10.492059 10.492059
10.492059 10.492059 10.492059 10.492059 10.492059 10.492059
5.131664 ]
=====> Dataset statistics
mean and std: [72.3924 82.90902 73.158325] [45.319206 46.15292 44.91484 ]
torch.cuda.device_count()= 2
Got the GPU count
length of dataset is : 3475
length of dataset: 500
=====> no checkpoint found at './checkpoint/cityscapes/CGNet_M3N21bs16gpu2_ontrainval/model_1.pth'
=====> beginning training
=====> the number of iterations per epoch: 868
torch.Size([4, 3, 680, 680])
torch.Size([4, 680, 680])
/home/nithish/my_install/miniconda3/envs/CGNet/lib/python3.6/site-packages/torch/nn/functional.py:2351: UserWarning: nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.
warnings.warn("nn.functional.upsample is deprecated. Use nn.functional.interpolate instead.")
torch.Size([4, 19, 680, 680])

/opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T *, T *, T *, long *, T *, int, int, int, int, int, long) [with T = float, AccumT = float]: block: [10,0,0], thread: [223,0,0] Assertion t >= 0 && t < n_classes failed.
Traceback (most recent call last):
File "cityscapes_train.py", line 291, in
train_model(args)
File "cityscapes_train.py", line 228, in train_model
lossTr, per_class_iu_tr, mIOU_tr, lr = train(args, trainLoader, model, criteria, optimizer, epoch)
File "cityscapes_train.py", line 100, in train
loss.backward()
File "/home/nithish/my_install/miniconda3/envs/CGNet/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/nithish/my_install/miniconda3/envs/CGNet/lib/python3.6/site-packages/torch/autograd/init.py", line 90, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

commented

@Karna829 What is your cudnn version? Cuda 8.0 and Cudnn-V7 is used in our system.

commented

I have Cuda 9.0 without cudnn. I also tried with Cuda-10 and Cudnn-V7 but still getting same error I will try with the mentioned cuda version and cudnn see if it works . What I found is that in some post they are mentioning the label values are still 255 not zeros and ones. I will try this too and will check. I found they changed some libraries in cuda because of which I am facing this error.Could you please check from your end too ?

commented

@Karna829 It is noted that the groud-truth is converted to trainID (Not labelID) by using our python script.

commented

@Karna829 trainID: 0,1,...,18, ignore_label=255

commented

Thank you I figured it out the issue was in the Cityscape dataset I was passing more labels than I should... Earlier I thought i had 19 classes in labelIDs but I was wrong. I have changed it to 19 and now I was able to train.

Thank you I figured it out the issue was in the Cityscape dataset I was passing more labels than I should... Earlier I thought i had 19 classes in labelIDs but I was wrong. I have changed it to 19 and now I was able to train.

Hi, I have the same problem, can you tell me how you solved it?