Multi-class scenario for metrics
hfaghihi15 opened this issue · comments
@auszok how do you compute the metrics for the multi class scenario?
@auszok please add the example runs that you had from the Conll model here for the multiclass scenario
examples that need to be checked for multiclass:
DomiKnowS\examples\AnimalAndFlower\main_shared_resnet_multiclass.py
DomiKnowS\examples\POS_tagging\main.py
Has anyone tested to see if this metric works fine now? @iamdanialkamali @AdmiralDarius @auszok
I checked it multiple times but I still get
ValueError: Incompatible lengths for category between inferred results 0 and labels {batch_size}
The code is at HLR/DomiKnowS/examples/AnimalAndFlower/main_shared_resnet_multiclass.py
it worked a few weeks ago. @auszok Could you take a look at it?
I ran the code again. it was working! However the metric results were not correct.
Ignoring the training phase, these are the results I get when I ran the program.test(val_ds)
INFO:regr.program.program: - metric:
INFO:regr.program.program: - - ILP
INFO:regr.program.program:{'category': {'P': tensor(0.7448), 'R': tensor(0.4886), 'F1': tensor(0.5901)}, 'tag': {'P': 0.10543840177580466, 'R': 0.10543840177580466, 'F1': 0.10543840177580466}}
INFO:regr.program.program: - - softmax
INFO:regr.program.program:{'category': {'P': tensor(0.7341), 'R': tensor(0.5084), 'F1': tensor(0.6007)}, 'tag': {'P': 0.11542730299667037, 'R': 0.11542730299667037, 'F1': 0.11542730299667037}}
INFO:regr.program.program: - - ILP_softmax_delta
INFO:regr.program.program:{'category': {'P': tensor(0.0107), 'R': tensor(-0.0198), 'F1': tensor(-0.0106)}, 'tag': {'P': -0.009988901220865709, 'R': -0.009988901220865709, 'F1': -0.009988901220865709}}
these are the result I get when I use sklearn metrics on extracted prediction and labels from program.populate(val_ds)
"local/softmax"
tags micro {'f1': 0.1076581576026637, 'P': 0.1076581576026637, 'R': 0.1076581576026637}
tags macro {'f1': 0.07233874444852077, 'P': 0.11014443368133868, 'R': 0.10632568087160382}
tags weighted {'f1': 0.0804753361670939, 'P': 0.14151848419999646, 'R': 0.1076581576026637}
tags accuracy_score 0.1076581576026637
precision recall f1-score support
0 0.057 0.344 0.098 61
1 0.079 0.085 0.082 59
2 0.064 0.060 0.062 50
3 0.000 0.000 0.000 73
4 0.109 0.041 0.060 122
5 0.429 0.038 0.070 158
6 0.073 0.027 0.039 113
7 0.000 0.000 0.000 116
8 0.181 0.362 0.241 149
accuracy 0.108 901
macro avg 0.110 0.106 0.072 901
weighted avg 0.142 0.108 0.080 901
category micro {'f1': 0.5149833518312985, 'P': 0.5149833518312985, 'R': 0.5149833518312985}
category micro {'f1': 0.5149833518312985, 'P': 0.5149833518312985, 'R': 0.5149833518312985}
category weighted {'f1': 0.5422392949155195, 'P': 0.6145658250795761, 'R': 0.5149833518312985}
category accuracy_score 0.5149833518312985
precision recall f1-score support
0 0.279 0.502 0.358 243
1 0.739 0.520 0.610 658
accuracy 0.515 901
macro avg 0.509 0.511 0.484 901
weighted avg 0.615 0.515 0.542 901
"ILP"
tags micro {'f1': 0.12763596004439512, 'P': 0.12763596004439512, 'R': 0.12763596004439512}
tags macro {'f1': 0.081618966147602, 'P': 0.11026226835335035, 'R': 0.12251350681604561}
tags weighted {'f1': 0.09101099218220682, 'P': 0.13077914950922068, 'R': 0.12763596004439512}
tags accuracy_score 0.12763596004439512
precision recall f1-score support
0 0.079 0.410 0.133 61
1 0.074 0.068 0.071 59
2 0.023 0.020 0.022 50
3 0.087 0.027 0.042 73
4 0.143 0.074 0.097 122
5 0.214 0.019 0.035 158
6 0.053 0.027 0.035 113
7 0.111 0.009 0.016 116
8 0.208 0.450 0.285 149
accuracy 0.128 901
macro avg 0.110 0.123 0.082 901
weighted avg 0.131 0.128 0.091 901
category micro {'f1': 0.5216426193118757, 'P': 0.5216426193118757, 'R': 0.5216426193118757}
category micro {'f1': 0.5216426193118757, 'P': 0.5216426193118757, 'R': 0.5216426193118757}
category weighted {'f1': 0.5485216270362456, 'P': 0.6201095381727602, 'R': 0.5216426193118757}
category accuracy_score 0.5216426193118757
precision recall f1-score support
0 0.284 0.510 0.365 243
1 0.744 0.526 0.616 658
accuracy 0.522 901
macro avg 0.514 0.518 0.491 901
weighted avg 0.620 0.522 0.549 901
@iamdanialkamali Inside the metrics I am using sklearn too. By default I calculate maco. Have you check the datanode.log to see if the prediction and labels retrieved from the datanode are the same as yours?
2021-11-23 07:48:59,638 - INFO - dataNode:getInferMetrics - Calling ILP metrics with conceptsRelations - (EnumConcept(name='category', fullname='AnimalAndFlower/category'),)
2021-11-23 07:48:59,639 - INFO - dataNode:getInferMetrics - Calculating metrics for concept category
2021-11-23 07:48:59,657 - INFO - dataNode:getInferMetrics - Concept category predictions from DataNode tensor([[1., 0.],
[1., 0.],
[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.],
[0., 1.],
[1., 0.],
[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.],
[1., 0.],
[1., 0.],
[1., 0.],
[0., 1.],
[1., 0.],
[0., 1.],
[1., 0.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.]])
2021-11-23 07:48:59,657 - INFO - dataNode:getInferMetrics - Concept category labels from DataNode tensor([1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
0, 1, 0, 1, 0, 1, 0, 0])
2021-11-23 07:48:59,659 - INFO - dataNode:getInferMetrics - Concept category is Multiclass
2021-11-23 07:48:59,659 - INFO - dataNode:getInferMetrics - Using average macro for Multiclass metrics calculation
2021-11-23 07:48:59,660 - INFO - dataNode:getInferMetrics - Calculating metrics for all class Labels of category
2021-11-23 07:48:59,660 - INFO - dataNode:getInferMetrics - Calling ILP metrics with conceptsRelations - ('animal', 'flower')
2021-11-23 07:48:59,660 - INFO - dataNode:getInferMetrics - Calculating metrics for concept animal
2021-11-23 07:48:59,665 - INFO - dataNode:getInferMetrics - Concept animal predictions from DataNode tensor([1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0.,
1., 1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0.])
2021-11-23 07:48:59,665 - INFO - dataNode:getInferMetrics - Concept animal labels from DataNode tensor([1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
0, 1, 0, 1, 0, 1, 0, 0])
2021-11-23 07:48:59,665 - INFO - dataNode:getInferMetrics - Index of class Labels animal is 0
2021-11-23 07:48:59,669 - INFO - dataNode:getInferMetrics - Concept animal - labels used for metrics calculation [0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1]
2021-11-23 07:48:59,669 - INFO - dataNode:getInferMetrics - Concept animal - Predictions used for metrics calculation [1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 0 0]
2021-11-23 07:48:59,674 - INFO - dataNode:getInferMetrics - Concept animal confusion matrix [[ 6 13]
[10 3]]
2021-11-23 07:48:59,677 - INFO - dataNode:getInferMetrics - Concept animal precision 0.28125
2021-11-23 07:48:59,678 - INFO - dataNode:getInferMetrics - Concept animal recall 0.2732793522267206
2021-11-23 07:48:59,679 - INFO - dataNode:getInferMetrics - Concept animal f1 0.2748768472906404
2021-11-23 07:48:59,679 - INFO - dataNode:getInferMetrics - Calculating metrics for concept flower
2021-11-23 07:48:59,684 - INFO - dataNode:getInferMetrics - Concept flower predictions from DataNode tensor([0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1., 0., 1.,
0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 1.])
2021-11-23 07:48:59,684 - INFO - dataNode:getInferMetrics - Concept flower labels from DataNode tensor([1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
0, 1, 0, 1, 0, 1, 0, 0])
2021-11-23 07:48:59,685 - INFO - dataNode:getInferMetrics - Index of class Labels flower is 1
2021-11-23 07:48:59,685 - INFO - dataNode:getInferMetrics - Concept flower - labels used for metrics calculation [1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 1 0 0]
2021-11-23 07:48:59,686 - INFO - dataNode:getInferMetrics - Concept flower - Predictions used for metrics calculation [0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1]
2021-11-23 07:48:59,686 - INFO - dataNode:getInferMetrics - Concept flower confusion matrix [[ 3 10]
[13 6]]
2021-11-23 07:48:59,688 - INFO - dataNode:getInferMetrics - Concept flower precision 0.28125
2021-11-23 07:48:59,688 - INFO - dataNode:getInferMetrics - Concept flower recall 0.2732793522267206
2021-11-23 07:48:59,689 - INFO - dataNode:getInferMetrics - Concept flower f1 0.2748768472906404
2021-11-23 07:48:59,690 - INFO - dataNode:getInferMetrics - Total precision is tensor(0.2812)
2021-11-23 07:48:59,691 - INFO - dataNode:getInferMetrics - Total recall is tensor(0.2812)
2021-11-23 07:48:59,692 - INFO - dataNode:getInferMetrics - Total F1 is tensor(0.2812)
2021-11-23 07:48:59,693 - INFO - dataNode:getInferMetrics - Concept category - labels used for metrics calculation [1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 1 0 0]
2021-11-23 07:48:59,693 - INFO - dataNode:getInferMetrics - Concept category - Predictions used for metrics calculation [0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1]
2021-11-23 07:48:59,694 - INFO - dataNode:getInferMetrics - Concept category confusion matrix [[ 3 10]
[13 6]]
2021-11-23 07:48:59,695 - INFO - dataNode:getInferMetrics - Concept category precision 0.28125
2021-11-23 07:48:59,696 - INFO - dataNode:getInferMetrics - Concept category recall 0.2732793522267206
2021-11-23 07:48:59,697 - INFO - dataNode:getInferMetrics - Concept category f1 0.2748768472906404
2021-11-23 07:48:59,697 - INFO - dataNode:getInferMetrics - Total precision is tensor(0.3750)
2021-11-23 07:48:59,698 - INFO - dataNode:getInferMetrics - Total recall is tensor(0.3158)
2021-11-23 07:48:59,698 - INFO - dataNode:getInferMetrics - Total F1 is tensor(0.3429)
2021-11-23 07:48:59,698 - INFO - dataNodeBuilder:getDataNode - Returning dataNode with id 0 of type image_group
2021-11-23 07:48:59,699 - INFO - dataNode:getInferMetrics - Calling local/argmax metrics with conceptsRelations - (EnumConcept(name='category', fullname='AnimalAndFlower/category'),)
2021-11-23 07:48:59,699 - INFO - dataNode:getInferMetrics - Calculating metrics for concept category
2021-11-23 07:48:59,728 - INFO - dataNode:getInferMetrics - Concept category predictions from DataNode tensor([[1., 0.],
[0., 1.],
[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.],
[1., 0.],
[0., 1.],
[0., 1.],
[1., 0.],
[1., 0.],
[1., 0.],
[0., 1.],
[0., 1.],
[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.],
[1., 0.],
[1., 0.],
[1., 0.],
[0., 1.],
[0., 1.],
[0., 1.],
[1., 0.],
[0., 1.],
[0., 1.],
[0., 1.],
[0., 1.]], device='cuda:0', grad_fn=<StackBackward>)
I checked it again. Strangely, the predictions from .populate()
and .test()
were different. I don't know why they are different. Is it due the fact that the model wasn't trained yet?
Regrading the metrics the results are fine for the Category
(superclass) which is a binary classification. but results for Tag
(sub-classes) are not correct. I wrote all the predictions and labels of each of these cases (generated in the dataNode.py
) and checked the macro avg using classification_report
. I think the number that gets generated from the Tag
is accuracy instead of F1 score. In that case, the numbers are correct !!
INFO:regr.program.program: - loss:
INFO:regr.program.program:{'category': tensor(0.7737), 'tag': tensor(2.3944)}
INFO:regr.program.program: - metric:
INFO:regr.program.program: - - ILP
INFO:regr.program.program:{'category': {'P': tensor(0.7448), 'R': tensor(0.4886), 'F1': tensor(0.5901)}, 'tag': {'P': 0.10543840177580466, 'R': 0.10543840177580466, 'F1': 0.10543840177580466}}
INFO:regr.program.program: - - softmax
INFO:regr.program.program:{'category': {'P': tensor(0.7341), 'R': tensor(0.5084), 'F1': tensor(0.6007)}, 'tag': {'P': 0.11542730299667037, 'R': 0.11542730299667037, 'F1': 0.11542730299667037}}
INFO:regr.program.program: - - ILP_softmax_delta
INFO:regr.program.program:{'category': {'P': tensor(0.0107), 'R': tensor(-0.0198), 'F1': tensor(-0.0106)}, 'tag': {'P': -0.009988901220865709, 'R': -0.009988901220865709, 'F1': -0.009988901220865709}}
category_ilp.txt
precision recall f1-score support
0 0.284 0.547 0.374 243
1 0.745 0.489 0.591 658
accuracy 0.505 901
macro avg 0.514 0.518 0.482 901
weighted avg 0.621 0.505 0.532 901
category_softmax.txt
precision recall f1-score support
0 0.274 0.502 0.355 243
1 0.735 0.509 0.601 658
accuracy 0.507 901
macro avg 0.504 0.506 0.478 901
weighted avg 0.610 0.507 0.535 901
tag_ilp.txt
precision recall f1-score support
0 0.081 0.459 0.138 61
1 0.026 0.017 0.020 59
2 0.018 0.020 0.019 50
3 0.133 0.055 0.078 73
4 0.127 0.066 0.086 122
5 0.083 0.006 0.012 158
6 0.043 0.018 0.025 113
7 0.000 0.000 0.000 116
8 0.167 0.336 0.223 149
accuracy 0.105 901
macro avg 0.075 0.108 0.067 901
weighted avg 0.084 0.105 0.072 901
tag_softmax.txt
precision recall f1-score support
0 0.080 0.492 0.137 61
1 0.042 0.034 0.037 59
2 0.059 0.060 0.059 50
3 0.111 0.055 0.073 73
4 0.176 0.074 0.104 122
5 0.214 0.019 0.035 158
6 0.026 0.009 0.013 113
7 0.250 0.009 0.017 116
8 0.181 0.342 0.237 149
accuracy 0.115 901
macro avg 0.127 0.121 0.079 901
weighted avg 0.147 0.115 0.084 901
@iamdanialkamali It looks like the metrics printed is not calculated by my code but by this:
`class PRF1Tracker(MetricTracker):
def init(self, metric=CMWithLogitsMetric()):
super().init(metric)
def forward(self, values):
CM = wrap_batch(values)
if isinstance(CM['TP'], list):
tp = sum(CM['TP'])
else:
tp = CM['TP'].sum().float()
if isinstance(CM['FP'], list):
fp = sum(CM['FP'])
else:
fp = CM['FP'].sum().float()
if isinstance(CM['FN'], list):
fn = sum(CM['FN'])
else:
fn = CM['FN'].sum().float()
if tp:
p = tp / (tp + fp)
r = tp / (tp + fn)
f1 = 2 * p * r / (p + r)
else:
p = torch.zeros_like(torch.tensor(tp))
r = torch.zeros_like(torch.tensor(tp))
f1 = torch.zeros_like(torch.tensor(tp))
return {'P': p, 'R': r, 'F1': f1}`
I guess we have to fix this for multiclass.
@iamdanialkamali metrics should be fixed. please check their correctness whenever you had time.
Please don't close the issue until we get the confirmation about the fix working correctly.
@AdmiralDarius Well done, I checked results for binary and multiclass both worked precisely