SurajDonthi / Multi-Camera-Person-Re-Identification

State-of-the-art model for person re-identification in Multi-camera Multi-Target Tracking. Benchmarked on Market-1501 and DukeMTMTC-reID datasets.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

training error (training on Market-1501)

toshalpatel opened this issue · comments

Hi I am facing the following error while training. Do you know what this might be about?
I have checked all the package versions. Also, I think this happens in one particular batch that is passed during the training.

/opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [15,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=118 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "mtmct_reid/main.py", line 56, in <module>
    main(args)
  File "mtmct_reid/main.py", line 42, in main
    trainer.fit(model, data_module) 
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py", line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1073, in fit
    results = self.accelerator_backend.train(model)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/accelerators/gpu_backend.py", line 51, in train
    results = self.trainer.run_pretrain_routine(model)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1239, in run_pretrain_routine
    self.train()
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 394, in train
    self.run_training_epoch()
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 516, in run_training_epoch
    self.run_evaluation(test_mode=False)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 582, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 333, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 687, in evaluation_forward
    output = model.validation_step(*args)
  File "/NAS/project01/rzimmerm_substitles/rzm_s_toshal/MTMCT-Person-Re-Identification/mtmct_reid/engine.py", line 162, in validation_step
    loss, acc = self.eval_shared_step(batch, batch_idx, dataloader_idx)
  File "/NAS/project01/rzimmerm_substitles/rzm_s_toshal/MTMCT-Person-Re-Identification/mtmct_reid/engine.py", line 113, in eval_shared_step
    loss, acc = self.shared_step(batch, batch_idx)
  File "/NAS/project01/rzimmerm_substitles/rzm_s_toshal/MTMCT-Person-Re-Identification/mtmct_reid/engine.py", line 78, in shared_step
    loss += self.criterion(part, y)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/functional.py", line 2422, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/NAS/home01/toshal/anaconda3/envs/myenv/lib/python3.7/site-packages/torch/nn/functional.py", line 2218, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1595629403081/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:118

You can solve it by replacing num_classes with num_classes+1
I guess there are [0, 751] classes (752 in total). I am wondering if you faced any similar issues while training @SurajDonthi

@toshalpatel Actually there are only 751 classes for the Market data. While parsing the data, all images are scanned to find the number of classes as well as assigning new class indexes. So the process is dynamic.

Not sure why the issue occurred. I apparently haven't faced this issue.

@toshalpatel The above error occurs because there exists an extra label -1 in the bounding_box_test data. These labels are actually left for the model to guess. Hence a comparison (loss func.) of these with the true label does not actually make sense. Currently, my code doesn't account for this. However, I'll be adding a much more robust testing function very soon.

So setting the classes to 751 is the right thing to do (which will break the current testing code and not the training code). So you can generate your model by training without validation/testing for the time being. However, if you want to test, feel free to write a function of your own for the time being.