training error (training on Market-1501)
toshalpatel opened this issue · comments
Hi I am facing the following error while training. Do you know what this might be about?
I have checked all the package versions. Also, I think this happens in one particular batch that is passed during the training.
/opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=118 error=710 : device-side assert triggered
Traceback (most recent call last):
File "mtmct_reid/main.py", line 56, in <module>
main(args)
File "mtmct_reid/main.py", line 42, in main
trainer.fit(model, data_module)
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
result = fn(self, *args, **kwargs)
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
results = self.accelerator_backend.train(model)
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
results = self.trainer.run_pretrain_routine(model)
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
self.train()
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
self.run_training_epoch()
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
self.run_evaluation(test_mode=False)
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 582, in run_evaluation
eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 333, in _evaluate
output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 687, in evaluation_forward
output = model.validation_step(*args)
File "/NAS/project01/rzimmerm_substitles/rzm_s_toshal/MTMCT-Person-Re-Identification/mtmct_reid/engine.py", line 162, in validation_step
loss, acc = self.eval_shared_step(batch, batch_idx, dataloader_idx)
File "/NAS/project01/rzimmerm_substitles/rzm_s_toshal/MTMCT-Person-Re-Identification/mtmct_reid/engine.py", line 113, in eval_shared_step
loss, acc = self.shared_step(batch, batch_idx)
File "/NAS/project01/rzimmerm_substitles/rzm_s_toshal/MTMCT-Person-Re-Identification/mtmct_reid/engine.py", line 78, in shared_step
loss += self.criterion(part, y)
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/functional.py", line 2422, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/functional.py", line 2218, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:118
You can solve it by replacing num_classes
with num_classes+1
I guess there are [0, 751] classes (752 in total). I am wondering if you faced any similar issues while training @SurajDonthi
@toshalpatel Actually there are only 751 classes for the Market data. While parsing the data, all images are scanned to find the number of classes as well as assigning new class indexes. So the process is dynamic.
Not sure why the issue occurred. I apparently haven't faced this issue.
@toshalpatel The above error occurs because there exists an extra label -1
in the bounding_box_test
data. These labels are actually left for the model to guess. Hence a comparison (loss func.) of these with the true label does not actually make sense. Currently, my code doesn't account for this. However, I'll be adding a much more robust testing function very soon.
So setting the classes to 751 is the right thing to do (which will break the current testing code and not the training code). So you can generate your model by training without validation/testing for the time being. However, if you want to test, feel free to write a function of your own for the time being.