training error (training on Market-1501)

Question

training error (training on Market-1501)

toshalpatel opened this issue 4 years ago · comments

Hi I am facing the following error while training. Do you know what this might be about?
I have checked all the package versions. Also, I think this happens in one particular batch that is passed during the training.

/opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=118 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "mtmct_reid/main.py", line 56, in <module>
    main(args)
  File "mtmct_reid/main.py", line 42, in main
    trainer.fit(model, data_module) 
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
    results = self.accelerator_backend.train(model)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
    results = self.trainer.run_pretrain_routine(model)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 582, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 333, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 687, in evaluation_forward
    output = model.validation_step(*args)
  File "/NAS/project01/rzimmerm_substitles/rzm_s_toshal/MTMCT-Person-Re-Identification/mtmct_reid/engine.py", line 162, in validation_step
    loss, acc = self.eval_shared_step(batch, batch_idx, dataloader_idx)
  File "/NAS/project01/rzimmerm_substitles/rzm_s_toshal/MTMCT-Person-Re-Identification/mtmct_reid/engine.py", line 113, in eval_shared_step
    loss, acc = self.shared_step(batch, batch_idx)
  File "/NAS/project01/rzimmerm_substitles/rzm_s_toshal/MTMCT-Person-Re-Identification/mtmct_reid/engine.py", line 78, in shared_step
    loss += self.criterion(part, y)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/functional.py", line 2422, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/functional.py", line 2218, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:118

Toshal Patel · Answer 1 · Tue Nov 10 2020 23:22:56 GMT+0800 (China Standard Time)

You can solve it by replacing num_classes with num_classes+1
I guess there are [0, 751] classes (752 in total). I am wondering if you faced any similar issues while training @SurajDonthi

Suraj Donthi · Answer 2 · Wed Nov 11 2020 15:55:21 GMT+0800 (China Standard Time)

@toshalpatel Actually there are only 751 classes for the Market data. While parsing the data, all images are scanned to find the number of classes as well as assigning new class indexes. So the process is dynamic.

Not sure why the issue occurred. I apparently haven't faced this issue.

Suraj Donthi · Answer 3 · Tue Nov 24 2020 02:08:10 GMT+0800 (China Standard Time)

@toshalpatel The above error occurs because there exists an extra label -1 in the bounding_box_test data. These labels are actually left for the model to guess. Hence a comparison (loss func.) of these with the true label does not actually make sense. Currently, my code doesn't account for this. However, I'll be adding a much more robust testing function very soon.

So setting the classes to 751 is the right thing to do (which will break the current testing code and not the training code). So you can generate your model by training without validation/testing for the time being. However, if you want to test, feel free to write a function of your own for the time being.