reproducing the Table 5 result of the paper

Question

reproducing the Table 5 result of the paper

abvesa opened this issue 8 months ago · comments

hi, i'm the author of the issue: Is the average ranking meaningful since each algorithm is test on different number of datasets ?

first, thanks for the reply, and sorry about not mentioning the question is for the paper.
I'm now trying to reproduce the table 5 result in the paper, with the results metadataset_clean and metafeature_clean downloaded from google drive and the provided scripts 1-aggregate-results and 2-performance-rankings.

Since table 5 focus on only 36 Tabular Benchmark Suite datasets, I then subset the agg_df_with_default and agg_df using the datasets mentioned in /scripts/HARD_DATASETS_BENCHMARK.sh, before calculating ranks and saving result.

I add a column called dataset_count to see how many datasets were used for each algorithm calculating its statistics across all results, bellow is the result I got. We can see some of the numbers are different from the paper and some are not, more importantly, catboost, saint and node have exact same time/1000 inst. and nearly same logloss mean, logloss std compared to the paper, however, it seems the results of these three algorithms are calculated using different numbers of datasets.

I'm curious about if I am using the code wrong, can you provided some advice for how to fully reproduce the results of table 5, thank you !!

================================================================================
I first add a column called dataset_count and modify the get_rank_table function to calculate total dataset_count by adding a simple line:

Jeremy Ma · Answer 1 · Sat Dec 23 2023 14:52:22 GMT+0800 (China Standard Time)

May I ask how did you get 1-aggregate-results running?

A key file named metadataset.csv is missing for me. I then tried to generate file using tabzilla/TabZilla/tabzilla_results_aggregator.py. However, I then encountered permission issue from google cloud

google.api_core.exceptions.Forbidden: 403 GET https://storage.googleapis.com/storage/v1/b/tabzilla-results/o?projection=noAcl&prefix=results&prettyPrint=false: \myemail@gmail.com does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).

lavamore · Answer 2 · Sat Dec 23 2023 15:29:21 GMT+0800 (China Standard Time)

May I ask how did you get 1-aggregate-results running?

A key file named metadataset.csv is missing for me. I then tried to generate file using tabzilla/TabZilla/tabzilla_results_aggregator.py. However, I then encountered permission issue from google cloud

google.api_core.exceptions.Forbidden: 403 GET https://storage.googleapis.com/storage/v1/b/tabzilla-results/o?projection=noAcl&prefix=results&prettyPrint=false: [myemail@gmail.com](mailto:myemail@gmail.com) does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).

the google drive link of the results are provided in this jupyter notebook

also, you may need to rename the column training_time into time__train to fit the code.

Ping-Han Chiang · Answer 3 · Thu Dec 28 2023 13:59:37 GMT+0800 (China Standard Time)

This looks like a serious issue.
How come the benchmark result scientific solid if each model was evalualted on different numbers of dataset. :/

duncanmcelfresh · Answer 4 · Wed Jan 17 2024 11:22:07 GMT+0800 (China Standard Time)

Hello, thank you for pointing this out, we very much appreciate the feedback! We are updating the paper to fix this. Note that all other tables and figures in the main text that directly compare algorithms against each other (e.g. Tables 1, 2, Figures 2, 3) use the same number of datasets. For completeness, we had also included some tables in the appendix where algorithms didn't have the same number of datasets, and in that case we gave a caveat about it (Section D.2.1 on page 24).
Thank you for this discussion!

Andreas Mueller · Answer 5 · Thu Mar 28 2024 13:11:36 GMT+0800 (China Standard Time)

@duncanmcelfresh can you say which datasets are used for tables 1 and 2? I would like to reproduce these tables. It looks like the data you made available only contains results for TabPFN for 63 datasets, so I'm not sure whether you have updated results, or if you used a different number of datasets for it?