Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducing the Table 5 result of the paper #96

Open
abvesa opened this issue Dec 23, 2023 · 5 comments
Open

reproducing the Table 5 result of the paper #96

abvesa opened this issue Dec 23, 2023 · 5 comments

Comments

@abvesa
Copy link

abvesa commented Dec 23, 2023

hi, i'm the author of the issue: Is the average ranking meaningful since each algorithm is test on different number of datasets ?

first, thanks for the reply, and sorry about not mentioning the question is for the paper.
I'm now trying to reproduce the table 5 result in the paper, with the results metadataset_clean and metafeature_clean downloaded from google drive and the provided scripts 1-aggregate-results and 2-performance-rankings.

Since table 5 focus on only 36 Tabular Benchmark Suite datasets, I then subset the agg_df_with_default and agg_df using the datasets mentioned in /scripts/HARD_DATASETS_BENCHMARK.sh, before calculating ranks and saving result.

I add a column called dataset_count to see how many datasets were used for each algorithm calculating its statistics across all results, bellow is the result I got. We can see some of the numbers are different from the paper and some are not, more importantly, catboost, saint and node have exact same time/1000 inst. and nearly same logloss mean, logloss std compared to the paper, however, it seems the results of these three algorithms are calculated using different numbers of datasets.

I'm curious about if I am using the code wrong, can you provided some advice for how to fully reproduce the results of table 5, thank you !!

螢幕擷取畫面 2023-12-23 103559
螢幕擷取畫面 2023-12-23 103700

================================================================================
I first add a column called dataset_count and modify the get_rank_table function to calculate total dataset_count by adding a simple line:
螢幕擷取畫面 2023-12-23 110030
螢幕擷取畫面 2023-12-23 105136

@junweima
Copy link

junweima commented Dec 23, 2023

May I ask how did you get 1-aggregate-results running?

A key file named metadataset.csv is missing for me. I then tried to generate file using tabzilla/TabZilla/tabzilla_results_aggregator.py. However, I then encountered permission issue from google cloud

google.api_core.exceptions.Forbidden: 403 GET https://storage.googleapis.com/storage/v1/b/tabzilla-results/o?projection=noAcl&prefix=results&prettyPrint=false: \[email protected] does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).

@abvesa
Copy link
Author

abvesa commented Dec 23, 2023

May I ask how did you get 1-aggregate-results running?

A key file named metadataset.csv is missing for me. I then tried to generate file using tabzilla/TabZilla/tabzilla_results_aggregator.py. However, I then encountered permission issue from google cloud

google.api_core.exceptions.Forbidden: 403 GET https://storage.googleapis.com/storage/v1/b/tabzilla-results/o?projection=noAcl&prefix=results&prettyPrint=false: [[email protected]](mailto:[email protected]) does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist).

the google drive link of the results are provided in this jupyter notebook

also, you may need to rename the column training_time into time__train to fit the code.

@u10000129
Copy link

This looks like a serious issue.
How come the benchmark result scientific solid if each model was evalualted on different numbers of dataset. :/

@duncanmcelfresh
Copy link
Collaborator

Hello, thank you for pointing this out, we very much appreciate the feedback! We are updating the paper to fix this. Note that all other tables and figures in the main text that directly compare algorithms against each other (e.g. Tables 1, 2, Figures 2, 3) use the same number of datasets. For completeness, we had also included some tables in the appendix where algorithms didn't have the same number of datasets, and in that case we gave a caveat about it (Section D.2.1 on page 24).
Thank you for this discussion!

@amueller
Copy link

amueller commented Mar 28, 2024

@duncanmcelfresh can you say which datasets are used for tables 1 and 2? I would like to reproduce these tables. It looks like the data you made available only contains results for TabPFN for 63 datasets, so I'm not sure whether you have updated results, or if you used a different number of datasets for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants