Performance evaluation may hint at bug?
gdalle opened this issue · comments
In the paper, you state
For the multiclass Iris classification and the Boston Housing regression datasets, the performance was worse than the other models. It could be that this is caused by a bug in the implementation or because this is a fundamental issue in the algorithm
Isn't it possible to compare with the numerical results of the original paper to validate the implementation?
See also #50
Isn't it possible to compare with the numerical results of the original paper to validate the implementation?
Sorry for taking so long to respond. This was a very good question and it took me a while to find time to dig into it. It's a good point. I hadn't considered that.
Firstly, I'll summarize the results here from what we've reported in our paper and what the original paper reported for the Diabetes, Haberman, and Titanic dataset:
Dataset | SIRUS | SIRUS.jl | Difference |
---|---|---|---|
Diabetes | 1 - 0.19 = 0.81 | 0.75 ± 0.05 | -7% |
Haberman | 1 - 0.35 = 0.65 | 0.67 ± 0.06 | 5% |
Titanic | 1 - 0.17 = 0.83 | 0.83 ± 0.02 | 0% |
Here, SIRUS scores come from the Project Euclid by Benard et al. (2021; Table 4). I've converted the 1 - AUC scores from that paper back to AUC scores. The SIRUS.jl scores come, again, from the CI run for version 1.3.2 with max_rules = 10
.
Given that the scores are reasonably similar while the cross-validation splits differ, I see no reason to believe that SIRUS.jl performs worse or better than SIRUS on these three classification datasets.
For regression, I'll try to compare to the datasets from their MLResearchPress paper at http://proceedings.mlr.press/v130/benard21a. There they present unexplained variance in Table 3.
I'll get to this at the end of this week or at the beginning of next week. Apologies for the delay. I keep having PhD-related stuff popping up.