Benchmark results
nurlanov-zh opened this issue · comments
Hello,
Thanks for your efforts in benchmarking the existing verification methods.
I have a question regarding the results in Table 3 of Appendix in your paper https://arxiv.org/pdf/2009.04131.pdf . Is it possible to also provide numbers for differentiating between loose bounds and slow verification? I.e. what is the percentage of examples that were counted as "non-robust" due to Time Limit?
I think this would be useful to estimate how methods would work with different time limits and how tight the bounds of efficient methods are.
Thanks,
Zhakshylyk Nurlanov
Thanks for your interest in our benchmarking results and our paper!
In our benchmark, actually, it is almost always the case that the evaluated approach either terminates or gets timeout for every instance within one setting (one network + training method + epsilon combination is one setting). To distinguish these two outcomes, you can look at our page here: https://sokcertifiedrobustness.github.io/benchmark/. For Deterministic Verification Approaches (Probabilistic ones do not time out), the "Full Results" tab records both the certified accuracy/radius and the average running time. If one approach always achieves "60.00s" running time, since the time limit is exactly 60s, this approach always gets timeout in this setting. Otherwise, it almost never timeout. This brings a qualitative estimation of the efficiency and tightness of different approaches.
To regenerate the tables with clear distinction between "timeout" and "loose bound", you can also write a script following https://github.com/sokcertifiedrobustness/certified-robustness-benchmark/blob/master/experiments/data_analyzer.py to analyze the raw evaluation data stored at https://github.com/sokcertifiedrobustness/certified-robustness-benchmark/tree/master/experiments/data_old. Thanks!
Best,
Linyi