A mismatch between the number of classes in the preds tensor and the num_classes parameter in the Accuracy

Question

A mismatch between the number of classes in the preds tensor and the num_classes parameter in the Accuracy

matrxsoftware opened this issue a year ago · comments

I hope you are doing well. I am trying to train Market1501 on your model BPBreID, and I set the init() function as follows:

'self.pred_accuracy = Accuracy(top_k=1, task='multiclass', num_classes=751)'
However, I encountered the following error message:
((=> Start training
pixels_cls_scores shape: torch.Size([98304, 6])
pixels_cls_score_targets shape: torch.Size([98304])
Traceback (most recent call last):
File "scripts/main.py", line 278, in
main()
File "scripts/main.py", line 183, in main
engine.run(**engine_run_kwargs(cfg))
File "/reid/bpbreid-main/torchreid/engine/engine.py", line 206, in run
open_layers=open_layers
File "/reid/bpbreid-main/torchreid/engine/engine.py", line 268, in train
loss, loss_summary = self.forward_backward(data)
File "/reid/bpbreid-main/torchreid/engine/image/part_based_engine.py", line 96, in forward_backward
bpa_weight=self.losses_weights[PIXELS]['ce'])
File "/reid/bpbreid-main/torchreid/engine/image/part_based_engine.py", line 127, in combine_losses
bpa_loss, bpa_loss_summary = self.body_part_attention_loss(pixels_cls_scores, pixels_cls_score_targets)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/reid/bpbreid-main/torchreid/losses/body_part_attention_loss.py", line 40, in forward
pixels_cls_loss, pixels_cls_accuracy = self.compute_pixels_cls_loss(pixels_cls_scores, targets)
File "/reid/bpbreid-main/torchreid/losses/body_part_attention_loss.py", line 54, in compute_pixels_cls_loss
accuracy = self.pred_accuracy(pixels_cls_scores, pixels_cls_score_targets)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torchmetrics/metric.py", line 236, in forward
self._forward_cache = self._forward_reduce_state_update(*args, **kwargs)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torchmetrics/metric.py", line 302, in _forward_reduce_state_update
self.update(*args, **kwargs)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torchmetrics/metric.py", line 390, in wrapped_func
update(*args, **kwargs)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torchmetrics/classification/stat_scores.py", line 316, in update
preds, target, self.num_classes, self.multidim_average, self.ignore_index
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torchmetrics/functional/classification/stat_scores.py", line 272, in _multiclass_stat_scores_tensor_validation
"If preds have one dimension more than target, preds.shape[1] should be"
ValueError: If preds have one dimension more than target, preds.shape[1] should be equal to the number of classes.))

So, can you please tell me how we could overcome the error related to the mismatch between the number of classes in the preds tensor and the num_classes parameter in the Accuracy metric? Is the num_classes for Market1501, not 751?

Many thanks,
GH

Vladimir Somers · Answer 1 · Fri May 05 2023 20:02:28 GMT+0800 (China Standard Time)

Hi, I see the stack trace is related to the BodyPartAttentionLoss (in body_part_attention_loss.py), in this loss, the classes/targets for the Accuracy metric are the body parts (5 in your case, 6 with background) and not the 751 training identities from Market1501. Here, a classifier will be applied on top of each pixel in the spatial feature map output by the backbone, and each pixel will be classified into one of the body parts (or background), producing the "pixels_cls_scores". Here your 'num_classes' should be 6, not 751.

matrxsoftware · Answer 2 · Sun May 07 2023 00:17:17 GMT+0800 (China Standard Time)

Hello Vladimir,

Thank you for your response...
I tried num_classe =6 before but got the same error every time!
(((Model complexity: params=39,847,494 flops=8,005,197,312
Building part_based-engine for image-reid
Starting experiment d96d184a-9106-4806-a121-b110b39fe7b1 with job id 114434068 and creation date 2023_05_06_16_48_08_48S
=> Start training
Traceback (most recent call last):
File "scripts/main.py", line 272, in
main()
File "scripts/main.py", line 183, in main
engine.run(**engine_run_kwargs(cfg))
File "/torchreid/engine/engine.py", line 206, in run
open_layers=open_layers
File "/torchreid/engine/engine.py", line 268, in train
loss, loss_summary = self.forward_backward(data)
File "/torchreid/engine/image/part_based_engine.py", line 95, in forward_backward
bpa_weight=self.losses_weights[PIXELS]['ce'])
File "/torchreid/engine/image/part_based_engine.py", line 110, in combine_losses
loss, loss_summary = self.GiLt(embeddings_dict, visibility_scores_dict, id_cls_scores_dict, pids)
File "/torchreid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/torchreid/losses/GiLt_loss.py", line 67, in forward
visibility_scores_dict[key], pids)
File "/torchreid/losses/GiLt_loss.py", line 118, in compute_id_cls_loss
accuracy = self.pred_accuracy(id_cls_scores, pids)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torchmetrics/metric.py", line 236, in forward
self._forward_cache = self._forward_reduce_state_update(*args, **kwargs)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torchmetrics/metric.py", line 302, in _forward_reduce_state_update
self.update(*args, **kwargs)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torchmetrics/metric.py", line 390, in wrapped_func
update(*args, **kwargs)
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torchmetrics/classification/stat_scores.py", line 316, in update
preds, target, self.num_classes, self.multidim_average, self.ignore_index
File "/anaconda3/envs/torchreid/lib/python3.7/site-packages/torchmetrics/functional/classification/stat_scores.py", line 272, in _multiclass_stat_scores_tensor_validation
"If preds have one dimension more than target, preds.shape[1] should be"
ValueError: If preds have one dimension more than target, preds.shape[1] should be equal to number of classes.)))

I also tried to print statements in different places to determine the cause of these errors!
((Model complexity: params=39,847,494 flops=8,005,197,312
Building part_based-engine for image-reid
datamanager.num_train_pids: 751
datamanager.num_train_pids type: <class 'int'>
num_classes before build_engine: 751
num_classes type before build_engine: <class 'int'>
num_classes inside build_engine: 751
num_classes type inside build_engine: <class 'int'>
Traceback (most recent call last):
))
That's why I set the num_classe =751 at the beginning. However, I ran out of all things I know!!

I appreciate your help!

Thx,
GH

Vladimir Somers · Answer 3 · Sun May 07 2023 01:12:42 GMT+0800 (China Standard Time)

Hi, first of all, this is not the same error you have here: the track trace indicates the error occur in GiLt_loss.py, while it occurred in body_part_attention_loss.py in the previous stack trace. For Accuracy metric in GiLt_loss.py, you should use the number of training identities (i.e. 751) and for the Accuracy metric in body_part_attention_loss.py, you should use the number of body parts + 1, (i.e. 6).

You shouldn't set these numbers manually, it should be handled automatically by the framework, so I guess you are working on some custom code and cannot make it work? Sorry but, if you are working on custom code, I cannot help further if I cannot access it

matrxsoftware · Answer 4 · Sun May 07 2023 01:44:14 GMT+0800 (China Standard Time)

Finally, it solves by setting the num_classes parameter of the accuracy metric instance to be equal to the number of classes in the dataset, So when I update the num_classes parameter of the self.pred_accuracy instance right before the line where you call accuracy = self.pred_accuracy(id_cls_scores, pids) in the compute_id_cls_loss function like this:
'self.pred_accuracy.num_classes = id_cls_scores.shape[1]'
Finally, it was solved, and the training started!
Thank you very much, and I appreciate your response!
GH