When training own dataset, an error occurs when changing numberclasses to the corresponding category. If it is the default, it will report an error

Question

When training own dataset, an error occurs when changing numberclasses to the corresponding category. If it is the default, it will report an error

hx358031364 opened this issue 3 years ago · comments

AMP not enabled. Training in float32.
Using native Torch DistributedDataParallel.
Scheduled epochs: 310
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File "main.py", line 948, in
main()
File "main.py", line 664, in main
optimizers=optimizers)
File "main.py", line 782, in train_one_epoch
output = model(input)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 610, in forward
self._sync_params()
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1048, in _sync_params
authoritative_rank,
File "/root/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 979, in _distributed_broadcast_coalesced
self.process_group, tensors, buffer_size, authoritative_rank
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8