LayneH / SAT-selective-cls

Lines 200 to 201 in dc55593

    
           train_loss, train_acc = train(trainloader, model, criterion, optimizer, epoch, use_cuda) 
        
           test_loss, test_acc = test(testloader, model, criterion, epoch, use_cuda)

It might be a mistake to use the same criterion in function train and function test, which mixes up histroy of predictions of the model on training set and that on test set .

Thank you for pointing it out.
This is indeed a bug in our code that we should not pass the SAT criterion to the test() function.

I have rerun the experiments after fixing this bug and found that the performance is slightly improved.

Could you push your updated code to this repository? I did not get better performances after I fixed the bug and reran the experiments.

Hi,

Please refer to the latest commit. The scripts should produce slightly (if noticeable) better results than the reported ones.

Hi,
I find that even though I use updated code, I can not reproduce the results on CIFAR10 as reported in your paper. My results are following:

coverage	mean	stantard devariance
100	6.008	0.138
95	3.724	0.028
90	2.064	0.045
85	1.187	0.031
80	0.656	0.002
75	0.406	0.051
70	0.298	0.055

As the table shows, the selective error rate for 95% coverage is 3.72%, which is far away from (3.37±0.05)%. Could you help me solve this problem?

I am sorry for not explaining mean and standard deviation in the last comment. In the table of the last comment, mean and standard deviation refer to mean of selective error rate and standard deviation of selective error rate respectively, which are calculated over 3 trials.

Hi,

It seems that most entries are pretty close to or better than the reported ones in the paper except the case of 95% coverage.

I have checked the experiment logs and found that some of the CIFAR10 experiments (but none of the experiments on other datasets) are based on an earlier implementation of SAT, which slightly differs from the current implementation in this line

# current implementation
soft_label[torch.arange(y.shape[0]), y] = prob[torch.arange(y.shape[0]), y]
# earlier implementation
soft_label[torch.arange(y.shape[0]), y] = 1

You can try this to see the performance.

Hi,
I reran the experiments and got results as following (with the earlier implementation of SAT). mean and std dev refer to mean of selective error rate and standard deviation of selective error rate respectively in this table.

Test	mean	std dev
100	5.854	0.216
95	3.603	0.133
90	1.978	0.117
85	1.109	0.046
80	0.683	0.070
75	0.433	0.044
70	0.303	0.031

The performance is better than that of the current implementaton of SAT. But the selective error rate of coverage 95%, 3.603%, is still not so good as the reported one, (3.37±0.05)%, in your paper. Perhaps you had made a clerical mistake in your paper?

Interesting reproduction analysis, did this eventually get resolved?
Should one use the main branch for reproductions?

Interesting reproduction analysis, did this eventually get resolved? Should one use the main branch for reproductions?

No, I gave up. This repository does not provide the random seed manualSeed, making it challenging to reproduce the results.

Might I ask you if you know of any other selective classification methods that 'actually work'?
I was looking into self-adaptive training as well, which seems related.

As far as I know, Deep Ensemble [1] really works and might be the most powerful method. However, considering the heavy computational overhead of ensemble models, recent work in selective classification focuses on individual models. These models (e.g., [2][3]) exhibit marginal improvement from Softmax Response [4]. The advance in this line of work seems neither significant nor exciting. Nevertheless, my survey might be not comprehensive. A more comprehensive survey might be found in [5][6].

[1] Lakshminarayanan et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In NIPS, 2017.
[2] Liu et al. Deep Gamblers: Learning to Abstain with Portfolio Theory. In NeurIPS, 2019.
[3] Feng et al. Towards Better Selective Classification. In ICLR, 2023.
[4] Geifman and El-Yaniv. Selective Classification for Deep Neural Networks. In NIPS, 2017.
[5] Gawlikowski et al. A Survey of Uncertainty in Deep Neural Networks. arXiv:2107.03342.
[6] Galil et al. What Can we Learn from The Selective Prediction and Uncertainty Estimation Performance Of 523 ImageNet Classifiers? In ICLR, 2023.

	train_loss, train_acc = train(trainloader, model, criterion, optimizer, epoch, use_cuda)
	test_loss, test_acc = test(testloader, model, criterion, epoch, use_cuda)

criterions of training and test are mixed up