Cannot get Image Segmentation training to work with custom dataset

Question

Cannot get Image Segmentation training to work with custom dataset

mattyhatch opened this issue 2 years ago · comments

I can't get it to completely work with my custom dataset. Depending on the batch size and input data size, it will get through a couple epochs of data (The best run I've gotten is getting through around 300 of the 557 training photos), but then it hits an error code. It's also unintuitive which batch sizes gets through the most iterations because you would think that a smaller batch-size would be easier to run, but for me it fails on the first iteration with a batch size of 1(Should be the least compute intensive), but it trains until photo 310 on batch size 20(This link explains why that's the case in one of the comments: https://discuss.pytorch.org/t/runtimeerror-cudnn-error-cudnn-status-execution-failed-when-calling-backward/39022). The error code is also different depending on the batch size. When I've looked up these error codes, most of the solutions I see say that you either just need to change your batch size, or get more computing power, but when I run the model with task manager open, the GPU is no where near being maxed out in either its memory or computing power so we can rule out that and I've have tried every batch size from 1 to like 24 with no avail. I have also tried just limiting the number of iterations to 250 on a batch size that gets past that amount of iterations so that it will save the trained model, but when I try to run that saved model in validation mode, it gets an accuracy of 0.0000 and just doesn't work, but that could be because of my validation setup as that was the first time I have tried to validate a model, still need to investigate but given the errors in training I am assuming its because of that. Have you ever ran into these same type of error codes and found a solution? Here are the different error codes I was getting:

List of errors depending on batch size:

Batch size 14, 128x128 on iter 285:

2023-02-24 14:38:50 [INFO] 285
Traceback (most recent call last):
File "train.py", line 173, in
main()
File "train.py", line 123, in main
loss.backward()
File "", line 2, in backward
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in impl
return wrapped_func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\framework.py", line 225, in impl
return func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 235, in backward
core.dygraph_run_backward([self], [grad_tensor], retain_graph,
OSError: (External) Cublas error, CUBLAS_STATUS_EXECUTION_FAILED. The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons. (at C:\home\workspace\Paddle_release\paddle/fluid/operators/math/blas_impl.cu.h:40)

Batch size 12, 128x128 on iter 47:

2023-02-24 14:42:05 [INFO] 47
Traceback (most recent call last):
File "train.py", line 173, in
main()
File "train.py", line 123, in main
loss.backward()
File "", line 2, in backward
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in impl
return wrapped_func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\framework.py", line 225, in impl
return func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 235, in backward
core.dygraph_run_backward([self], [grad_tensor], retain_graph,
OSError: (External) Cublas error, CUBLAS_STATUS_INTERNAL_ERROR. An internal cuBLAS operation failed. This error is usually caused by a cudaMemcpyAsync() failure. (at C:\home\workspace\Paddle_release\paddle/fluid/operators/math/blas_impl.cu.h:35)

Batch size 8, 128x128 on iter 3:

2023-02-24 14:43:43 [INFO] 3
Traceback (most recent call last):
File "train.py", line 173, in
main()
File "train.py", line 123, in main
loss.backward()
File "", line 2, in backward
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in impl
return wrapped_func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\framework.py", line 225, in impl
return func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 235, in backward
core.dygraph_run_backward([self], [grad_tensor], retain_graph,
OSError: (External) Cublas error, CUBLAS_STATUS_INTERNAL_ERROR. An internal cuBLAS operation failed. This error is usually caused by a cudaMemcpyAsync() failure. (at C:\home\workspace\Paddle_release\paddle/fluid/operators/math/blas_impl.cu.h:35)

Batch size 16, 128x128 on iter 11:

2023-02-24 14:45:05 [INFO] 11
Traceback (most recent call last):
File "train.py", line 173, in
main()
File "train.py", line 123, in main
loss.backward()
File "", line 2, in backward
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in impl
return wrapped_func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\framework.py", line 225, in impl
return func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 235, in backward
core.dygraph_run_backward([self], [grad_tensor], retain_graph,
OSError: (External) Cublas error, CUBLAS_STATUS_EXECUTION_FAILED. The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons. (at C:\home\workspace\Paddle_release\paddle/fluid/operators/math/blas_impl.cu.h:35)

Batch size 1, 128x128 on iter 1:

2023-02-24 15:11:25 [INFO] 1
Traceback (most recent call last):
File "train.py", line 173, in
main()
File "train.py", line 123, in main
loss.backward()
File "", line 2, in backward
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in impl
return wrapped_func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\framework.py", line 225, in impl
return func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 235, in backward
core.dygraph_run_backward([self], [grad_tensor], retain_graph,
OSError: (External) Cudnn error, CUDNN_STATUS_EXECUTION_FAILED (at C:/home/workspace/Paddle_release/paddle/fluid/operators/conv_cudnn_op.cu:790)

Batch size 20, 128x128 on iter 310:

2023-02-24 15:15:19 [INFO] 310
Traceback (most recent call last):
File "train.py", line 173, in
main()
File "train.py", line 123, in main
loss.backward()
File "", line 2, in backward
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\wrapped_decorator.py", line 25, in impl
return wrapped_func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\framework.py", line 225, in impl
return func(*args, **kwargs)
File "C:\Users\matty\AppData\Local\Programs\Python\Python38\lib\site-packages\paddle\fluid\dygraph\varbase_patch_methods.py", line 235, in backward
core.dygraph_run_backward([self], [grad_tensor], retain_graph,
OSError: (External) Cuda error(700), an illegal memory access was encountered.
[Advise: Please search for the error code(700) on website( https://docs.nvidia.com/cuda/archive/10.0/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 ) to get Nvidia's official solution about CUDA Error.] (at C:\home\workspace\Paddle_release\paddle\fluid\platform\gpu_info.cc:382)