reproducing the Table 5 result of the paper #96

abvesa · 2023-12-23T02:45:14Z

hi, i'm the author of the issue: Is the average ranking meaningful since each algorithm is test on different number of datasets ?

first, thanks for the reply, and sorry about not mentioning the question is for the paper.
I'm now trying to reproduce the table 5 result in the paper, with the results metadataset_clean and metafeature_clean downloaded from google drive and the provided scripts 1-aggregate-results and 2-performance-rankings.

Since table 5 focus on only 36 Tabular Benchmark Suite datasets, I then subset the agg_df_with_default and agg_df using the datasets mentioned in /scripts/HARD_DATASETS_BENCHMARK.sh, before calculating ranks and saving result.

I add a column called dataset_count to see how many datasets were used for each algorithm calculating its statistics across all results, bellow is the result I got. We can see some of the numbers are different from the paper and some are not, more importantly, catboost, saint and node have exact same time/1000 inst. and nearly same logloss mean, logloss std compared to the paper, however, it seems the results of these three algorithms are calculated using different numbers of datasets.

I'm curious about if I am using the code wrong, can you provided some advice for how to fully reproduce the results of table 5, thank you !!

================================================================================
I first add a column called dataset_count and modify the get_rank_table function to calculate total dataset_count by adding a simple line:

The text was updated successfully, but these errors were encountered:

junweima · 2023-12-23T06:52:22Z

May I ask how did you get 1-aggregate-results running?

A key file named metadataset.csv is missing for me. I then tried to generate file using tabzilla/TabZilla/tabzilla_results_aggregator.py. However, I then encountered permission issue from google cloud

google.api_core.exceptions.Forbidden: 403 GET https://storage.googleapis.com/storage/v1/b/tabzilla-results/o?projection=noAcl&prefix=results&prettyPrint=false: \[email protected] does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).

abvesa · 2023-12-23T07:29:21Z

May I ask how did you get 1-aggregate-results running?

A key file named metadataset.csv is missing for me. I then tried to generate file using tabzilla/TabZilla/tabzilla_results_aggregator.py. However, I then encountered permission issue from google cloud

google.api_core.exceptions.Forbidden: 403 GET https://storage.googleapis.com/storage/v1/b/tabzilla-results/o?projection=noAcl&prefix=results&prettyPrint=false: [[email protected]](mailto:[email protected]) does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).

the google drive link of the results are provided in this jupyter notebook

also, you may need to rename the column training_time into time__train to fit the code.

u10000129 · 2023-12-28T05:59:37Z

This looks like a serious issue.
How come the benchmark result scientific solid if each model was evalualted on different numbers of dataset. :/

duncanmcelfresh · 2024-01-17T03:22:07Z

Hello, thank you for pointing this out, we very much appreciate the feedback! We are updating the paper to fix this. Note that all other tables and figures in the main text that directly compare algorithms against each other (e.g. Tables 1, 2, Figures 2, 3) use the same number of datasets. For completeness, we had also included some tables in the appendix where algorithms didn't have the same number of datasets, and in that case we gave a caveat about it (Section D.2.1 on page 24).
Thank you for this discussion!

amueller · 2024-03-28T05:11:36Z

@duncanmcelfresh can you say which datasets are used for tables 1 and 2? I would like to reproduce these tables. It looks like the data you made available only contains results for TabPFN for 63 datasets, so I'm not sure whether you have updated results, or if you used a different number of datasets for it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproducing the Table 5 result of the paper #96

reproducing the Table 5 result of the paper #96

abvesa commented Dec 23, 2023 •

edited

Loading

junweima commented Dec 23, 2023 •

edited

Loading

abvesa commented Dec 23, 2023

u10000129 commented Dec 28, 2023

duncanmcelfresh commented Jan 17, 2024

amueller commented Mar 28, 2024 •

edited

Loading

reproducing the Table 5 result of the paper #96

reproducing the Table 5 result of the paper #96

Comments

abvesa commented Dec 23, 2023 • edited Loading

junweima commented Dec 23, 2023 • edited Loading

abvesa commented Dec 23, 2023

u10000129 commented Dec 28, 2023

duncanmcelfresh commented Jan 17, 2024

amueller commented Mar 28, 2024 • edited Loading

abvesa commented Dec 23, 2023 •

edited

Loading

junweima commented Dec 23, 2023 •

edited

Loading

amueller commented Mar 28, 2024 •

edited

Loading