HLR / DomiKnowS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multi-class scenario for metrics

hfaghihi15 opened this issue · comments

@auszok how do you compute the metrics for the multi class scenario?

@auszok please add the example runs that you had from the Conll model here for the multiclass scenario

examples that need to be checked for multiclass:

DomiKnowS\examples\AnimalAndFlower\main_shared_resnet_multiclass.py
DomiKnowS\examples\POS_tagging\main.py

Has anyone tested to see if this metric works fine now? @iamdanialkamali @AdmiralDarius @auszok

I checked it multiple times but I still get

ValueError: Incompatible lengths for category between inferred results 0 and labels {batch_size}

The code is at HLR/DomiKnowS/examples/AnimalAndFlower/main_shared_resnet_multiclass.py
it worked a few weeks ago. @auszok Could you take a look at it?

I ran the code again. it was working! However the metric results were not correct.

Ignoring the training phase, these are the results I get when I ran the program.test(val_ds)

INFO:regr.program.program: - metric:
INFO:regr.program.program: - - ILP
INFO:regr.program.program:{'category': {'P': tensor(0.7448), 'R': tensor(0.4886), 'F1': tensor(0.5901)}, 'tag': {'P': 0.10543840177580466, 'R': 0.10543840177580466, 'F1': 0.10543840177580466}}
INFO:regr.program.program: - - softmax
INFO:regr.program.program:{'category': {'P': tensor(0.7341), 'R': tensor(0.5084), 'F1': tensor(0.6007)}, 'tag': {'P': 0.11542730299667037, 'R': 0.11542730299667037, 'F1': 0.11542730299667037}}
INFO:regr.program.program: - - ILP_softmax_delta
INFO:regr.program.program:{'category': {'P': tensor(0.0107), 'R': tensor(-0.0198), 'F1': tensor(-0.0106)}, 'tag': {'P': -0.009988901220865709, 'R': -0.009988901220865709, 'F1': -0.009988901220865709}}

these are the result I get when I use sklearn metrics on extracted prediction and labels from program.populate(val_ds)

"local/softmax"


tags micro {'f1': 0.1076581576026637, 'P': 0.1076581576026637, 'R': 0.1076581576026637}
tags macro {'f1': 0.07233874444852077, 'P': 0.11014443368133868, 'R': 0.10632568087160382}
tags weighted {'f1': 0.0804753361670939, 'P': 0.14151848419999646, 'R': 0.1076581576026637}
tags accuracy_score 0.1076581576026637
              precision    recall  f1-score   support

           0      0.057     0.344     0.098        61
           1      0.079     0.085     0.082        59
           2      0.064     0.060     0.062        50
           3      0.000     0.000     0.000        73
           4      0.109     0.041     0.060       122
           5      0.429     0.038     0.070       158
           6      0.073     0.027     0.039       113
           7      0.000     0.000     0.000       116
           8      0.181     0.362     0.241       149

    accuracy                          0.108       901
   macro avg      0.110     0.106     0.072       901
weighted avg      0.142     0.108     0.080       901

category micro {'f1': 0.5149833518312985, 'P': 0.5149833518312985, 'R': 0.5149833518312985}
category micro {'f1': 0.5149833518312985, 'P': 0.5149833518312985, 'R': 0.5149833518312985}
category weighted {'f1': 0.5422392949155195, 'P': 0.6145658250795761, 'R': 0.5149833518312985}
category accuracy_score 0.5149833518312985
              precision    recall  f1-score   support

           0      0.279     0.502     0.358       243
           1      0.739     0.520     0.610       658

    accuracy                          0.515       901
   macro avg      0.509     0.511     0.484       901
weighted avg      0.615     0.515     0.542       901


"ILP"


tags micro {'f1': 0.12763596004439512, 'P': 0.12763596004439512, 'R': 0.12763596004439512}
tags macro {'f1': 0.081618966147602, 'P': 0.11026226835335035, 'R': 0.12251350681604561}
tags weighted {'f1': 0.09101099218220682, 'P': 0.13077914950922068, 'R': 0.12763596004439512}
tags accuracy_score 0.12763596004439512
              precision    recall  f1-score   support

           0      0.079     0.410     0.133        61
           1      0.074     0.068     0.071        59
           2      0.023     0.020     0.022        50
           3      0.087     0.027     0.042        73
           4      0.143     0.074     0.097       122
           5      0.214     0.019     0.035       158
           6      0.053     0.027     0.035       113
           7      0.111     0.009     0.016       116
           8      0.208     0.450     0.285       149

    accuracy                          0.128       901
   macro avg      0.110     0.123     0.082       901
weighted avg      0.131     0.128     0.091       901

category micro {'f1': 0.5216426193118757, 'P': 0.5216426193118757, 'R': 0.5216426193118757}
category micro {'f1': 0.5216426193118757, 'P': 0.5216426193118757, 'R': 0.5216426193118757}
category weighted {'f1': 0.5485216270362456, 'P': 0.6201095381727602, 'R': 0.5216426193118757}
category accuracy_score 0.5216426193118757
              precision    recall  f1-score   support

           0      0.284     0.510     0.365       243
           1      0.744     0.526     0.616       658

    accuracy                          0.522       901
   macro avg      0.514     0.518     0.491       901
weighted avg      0.620     0.522     0.549       901


@iamdanialkamali Inside the metrics I am using sklearn too. By default I calculate maco. Have you check the datanode.log to see if the prediction and labels retrieved from the datanode are the same as yours?

2021-11-23 07:48:59,638 - INFO - dataNode:getInferMetrics - Calling ILP metrics with conceptsRelations - (EnumConcept(name='category', fullname='AnimalAndFlower/category'),)
2021-11-23 07:48:59,639 - INFO - dataNode:getInferMetrics - Calculating metrics for concept category
2021-11-23 07:48:59,657 - INFO - dataNode:getInferMetrics - Concept category predictions from DataNode tensor([[1., 0.],
        [1., 0.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [0., 1.],
        [0., 1.],
        [0., 1.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [0., 1.],
        [0., 1.],
        [1., 0.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [1., 0.],
        [0., 1.],
        [1., 0.],
        [0., 1.],
        [1., 0.],
        [0., 1.],
        [0., 1.],
        [0., 1.],
        [0., 1.]])
2021-11-23 07:48:59,657 - INFO - dataNode:getInferMetrics - Concept category labels from DataNode tensor([1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
        0, 1, 0, 1, 0, 1, 0, 0])
2021-11-23 07:48:59,659 - INFO - dataNode:getInferMetrics - Concept category is Multiclass 
2021-11-23 07:48:59,659 - INFO - dataNode:getInferMetrics - Using average macro for Multiclass metrics calculation
2021-11-23 07:48:59,660 - INFO - dataNode:getInferMetrics - Calculating metrics for all class Labels of  category 
2021-11-23 07:48:59,660 - INFO - dataNode:getInferMetrics - Calling ILP metrics with conceptsRelations - ('animal', 'flower')
2021-11-23 07:48:59,660 - INFO - dataNode:getInferMetrics - Calculating metrics for concept animal
2021-11-23 07:48:59,665 - INFO - dataNode:getInferMetrics - Concept animal predictions from DataNode tensor([1., 1., 0., 1., 1., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 0., 1., 0.,
        1., 1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0.])
2021-11-23 07:48:59,665 - INFO - dataNode:getInferMetrics - Concept animal labels from DataNode tensor([1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
        0, 1, 0, 1, 0, 1, 0, 0])
2021-11-23 07:48:59,665 - INFO - dataNode:getInferMetrics - Index of class Labels animal is 0
2021-11-23 07:48:59,669 - INFO - dataNode:getInferMetrics - Concept animal - labels used for metrics calculation [0 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 1 0 1 1]
2021-11-23 07:48:59,669 - INFO - dataNode:getInferMetrics - Concept animal - Predictions used for metrics calculation [1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0 0 0 0]
2021-11-23 07:48:59,674 - INFO - dataNode:getInferMetrics - Concept animal confusion matrix [[ 6 13]
 [10  3]]
2021-11-23 07:48:59,677 - INFO - dataNode:getInferMetrics - Concept animal precision 0.28125
2021-11-23 07:48:59,678 - INFO - dataNode:getInferMetrics - Concept animal recall 0.2732793522267206
2021-11-23 07:48:59,679 - INFO - dataNode:getInferMetrics - Concept animal f1 0.2748768472906404
2021-11-23 07:48:59,679 - INFO - dataNode:getInferMetrics - Calculating metrics for concept flower
2021-11-23 07:48:59,684 - INFO - dataNode:getInferMetrics - Concept flower predictions from DataNode tensor([0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1., 0., 1.,
        0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1., 1., 1.])
2021-11-23 07:48:59,684 - INFO - dataNode:getInferMetrics - Concept flower labels from DataNode tensor([1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
        0, 1, 0, 1, 0, 1, 0, 0])
2021-11-23 07:48:59,685 - INFO - dataNode:getInferMetrics - Index of class Labels flower is 1
2021-11-23 07:48:59,685 - INFO - dataNode:getInferMetrics - Concept flower - labels used for metrics calculation [1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 1 0 0]
2021-11-23 07:48:59,686 - INFO - dataNode:getInferMetrics - Concept flower - Predictions used for metrics calculation [0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1]
2021-11-23 07:48:59,686 - INFO - dataNode:getInferMetrics - Concept flower confusion matrix [[ 3 10]
 [13  6]]
2021-11-23 07:48:59,688 - INFO - dataNode:getInferMetrics - Concept flower precision 0.28125
2021-11-23 07:48:59,688 - INFO - dataNode:getInferMetrics - Concept flower recall 0.2732793522267206
2021-11-23 07:48:59,689 - INFO - dataNode:getInferMetrics - Concept flower f1 0.2748768472906404
2021-11-23 07:48:59,690 - INFO - dataNode:getInferMetrics - Total precision is tensor(0.2812)
2021-11-23 07:48:59,691 - INFO - dataNode:getInferMetrics - Total recall is tensor(0.2812)
2021-11-23 07:48:59,692 - INFO - dataNode:getInferMetrics - Total F1 is tensor(0.2812)
2021-11-23 07:48:59,693 - INFO - dataNode:getInferMetrics - Concept category - labels used for metrics calculation [1 0 0 1 1 0 0 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 0 1 0 1 0 0]
2021-11-23 07:48:59,693 - INFO - dataNode:getInferMetrics - Concept category - Predictions used for metrics calculation [0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1]
2021-11-23 07:48:59,694 - INFO - dataNode:getInferMetrics - Concept category confusion matrix [[ 3 10]
 [13  6]]
2021-11-23 07:48:59,695 - INFO - dataNode:getInferMetrics - Concept category precision 0.28125
2021-11-23 07:48:59,696 - INFO - dataNode:getInferMetrics - Concept category recall 0.2732793522267206
2021-11-23 07:48:59,697 - INFO - dataNode:getInferMetrics - Concept category f1 0.2748768472906404
2021-11-23 07:48:59,697 - INFO - dataNode:getInferMetrics - Total precision is tensor(0.3750)
2021-11-23 07:48:59,698 - INFO - dataNode:getInferMetrics - Total recall is tensor(0.3158)
2021-11-23 07:48:59,698 - INFO - dataNode:getInferMetrics - Total F1 is tensor(0.3429)
2021-11-23 07:48:59,698 - INFO - dataNodeBuilder:getDataNode - Returning dataNode with id 0 of type image_group
2021-11-23 07:48:59,699 - INFO - dataNode:getInferMetrics - Calling local/argmax metrics with conceptsRelations - (EnumConcept(name='category', fullname='AnimalAndFlower/category'),)
2021-11-23 07:48:59,699 - INFO - dataNode:getInferMetrics - Calculating metrics for concept category
2021-11-23 07:48:59,728 - INFO - dataNode:getInferMetrics - Concept category predictions from DataNode tensor([[1., 0.],
        [0., 1.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [0., 1.],
        [0., 1.],
        [0., 1.],
        [0., 1.],
        [1., 0.],
        [0., 1.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [1., 0.],
        [0., 1.],
        [0., 1.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [0., 1.],
        [1., 0.],
        [1., 0.],
        [1., 0.],
        [0., 1.],
        [0., 1.],
        [0., 1.],
        [1., 0.],
        [0., 1.],
        [0., 1.],
        [0., 1.],
        [0., 1.]], device='cuda:0', grad_fn=<StackBackward>)

I checked it again. Strangely, the predictions from .populate() and .test() were different. I don't know why they are different. Is it due the fact that the model wasn't trained yet?
Regrading the metrics the results are fine for the Category(superclass) which is a binary classification. but results for Tag(sub-classes) are not correct. I wrote all the predictions and labels of each of these cases (generated in the dataNode.py) and checked the macro avg using classification_report. I think the number that gets generated from the Tag is accuracy instead of F1 score. In that case, the numbers are correct !!

INFO:regr.program.program: - loss:
INFO:regr.program.program:{'category': tensor(0.7737), 'tag': tensor(2.3944)}
INFO:regr.program.program: - metric:
INFO:regr.program.program: - - ILP
INFO:regr.program.program:{'category': {'P': tensor(0.7448), 'R': tensor(0.4886), 'F1': tensor(0.5901)}, 'tag': {'P': 0.10543840177580466, 'R': 0.10543840177580466, 'F1': 0.10543840177580466}}
INFO:regr.program.program: - - softmax
INFO:regr.program.program:{'category': {'P': tensor(0.7341), 'R': tensor(0.5084), 'F1': tensor(0.6007)}, 'tag': {'P': 0.11542730299667037, 'R': 0.11542730299667037, 'F1': 0.11542730299667037}}
INFO:regr.program.program: - - ILP_softmax_delta
INFO:regr.program.program:{'category': {'P': tensor(0.0107), 'R': tensor(-0.0198), 'F1': tensor(-0.0106)}, 'tag': {'P': -0.009988901220865709, 'R': -0.009988901220865709, 'F1': -0.009988901220865709}}

category_ilp.txt
              precision    recall  f1-score   support

           0      0.284     0.547     0.374       243
           1      0.745     0.489     0.591       658

    accuracy                          0.505       901
   macro avg      0.514     0.518     0.482       901
weighted avg      0.621     0.505     0.532       901

category_softmax.txt
              precision    recall  f1-score   support

           0      0.274     0.502     0.355       243
           1      0.735     0.509     0.601       658

    accuracy                          0.507       901
   macro avg      0.504     0.506     0.478       901
weighted avg      0.610     0.507     0.535       901

tag_ilp.txt
              precision    recall  f1-score   support

           0      0.081     0.459     0.138        61
           1      0.026     0.017     0.020        59
           2      0.018     0.020     0.019        50
           3      0.133     0.055     0.078        73
           4      0.127     0.066     0.086       122
           5      0.083     0.006     0.012       158
           6      0.043     0.018     0.025       113
           7      0.000     0.000     0.000       116
           8      0.167     0.336     0.223       149

    accuracy                          0.105       901
   macro avg      0.075     0.108     0.067       901
weighted avg      0.084     0.105     0.072       901

tag_softmax.txt
              precision    recall  f1-score   support

           0      0.080     0.492     0.137        61
           1      0.042     0.034     0.037        59
           2      0.059     0.060     0.059        50
           3      0.111     0.055     0.073        73
           4      0.176     0.074     0.104       122
           5      0.214     0.019     0.035       158
           6      0.026     0.009     0.013       113
           7      0.250     0.009     0.017       116
           8      0.181     0.342     0.237       149

    accuracy                          0.115       901
   macro avg      0.127     0.121     0.079       901
weighted avg      0.147     0.115     0.084       901

@iamdanialkamali It looks like the metrics printed is not calculated by my code but by this:

`class PRF1Tracker(MetricTracker):
def init(self, metric=CMWithLogitsMetric()):
super().init(metric)

def forward(self, values):
    CM = wrap_batch(values)
    
    if isinstance(CM['TP'], list):
        tp = sum(CM['TP'])
    else:
        tp = CM['TP'].sum().float()
        
    if isinstance(CM['FP'], list):
        fp = sum(CM['FP'])
    else:
        fp = CM['FP'].sum().float()
        
    if isinstance(CM['FN'], list):
        fn = sum(CM['FN'])
    else:
        fn = CM['FN'].sum().float()
        
    if tp:
        p = tp / (tp + fp)
        r = tp / (tp + fn)
        f1 = 2 * p * r / (p + r)
    else:
        p = torch.zeros_like(torch.tensor(tp))
        r = torch.zeros_like(torch.tensor(tp))
        f1 = torch.zeros_like(torch.tensor(tp))
    return {'P': p, 'R': r, 'F1': f1}`

I guess we have to fix this for multiclass.

@iamdanialkamali metrics should be fixed. please check their correctness whenever you had time.

Please don't close the issue until we get the confirmation about the fix working correctly.

@AdmiralDarius Well done, I checked results for binary and multiclass both worked precisely