-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reproducing the Table 5 result of the paper #96
Comments
May I ask how did you get A key file named
|
the google drive link of the results are provided in this jupyter notebook also, you may need to rename the column |
This looks like a serious issue. |
Hello, thank you for pointing this out, we very much appreciate the feedback! We are updating the paper to fix this. Note that all other tables and figures in the main text that directly compare algorithms against each other (e.g. Tables 1, 2, Figures 2, 3) use the same number of datasets. For completeness, we had also included some tables in the appendix where algorithms didn't have the same number of datasets, and in that case we gave a caveat about it (Section D.2.1 on page 24). |
@duncanmcelfresh can you say which datasets are used for tables 1 and 2? I would like to reproduce these tables. It looks like the data you made available only contains results for TabPFN for 63 datasets, so I'm not sure whether you have updated results, or if you used a different number of datasets for it? |
hi, i'm the author of the issue: Is the average ranking meaningful since each algorithm is test on different number of datasets ?
first, thanks for the reply, and sorry about not mentioning the question is for the paper.
I'm now trying to reproduce the table 5 result in the paper, with the results
metadataset_clean
andmetafeature_clean
downloaded from google drive and the provided scripts1-aggregate-results
and2-performance-rankings
.Since table 5 focus on only 36 Tabular Benchmark Suite datasets, I then subset the
agg_df_with_default
andagg_df
using the datasets mentioned in/scripts/HARD_DATASETS_BENCHMARK.sh
, before calculating ranks and saving result.I add a column called
dataset_count
to see how many datasets were used for each algorithm calculating its statistics across all results, bellow is the result I got. We can see some of the numbers are different from the paper and some are not, more importantly,catboost
,saint
andnode
have exact sametime/1000 inst.
and nearly samelogloss mean
,logloss std
compared to the paper, however, it seems the results of these three algorithms are calculated using different numbers of datasets.I'm curious about if I am using the code wrong, can you provided some advice for how to fully reproduce the results of table 5, thank you !!
================================================================================
I first add a column called
dataset_count
and modify theget_rank_table
function to calculate totaldataset_count
by adding a simple line:The text was updated successfully, but these errors were encountered: