Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which datasets are used for main paper (98 datasets) and "small data" (57 datasets) #100

Open
amueller opened this issue Mar 27, 2024 · 4 comments

Comments

@amueller
Copy link

Hi.
I'm trying to compare to some of the results in your work, but it's not clear to me which datasets were use for Table 1 and Table 2.
The Datasets A file contains 108 datasets, and the Datasets B file contains 69 datasets, so I'm not sure which the 98 ones are.
Really I care more about the 57 small datasets, but cutting off at those with 1250 or less instances doesn't yield 57 for either A or B or the combination.

@amueller
Copy link
Author

The "easy_import" list seems to contain 175 classification tasks, 69 of which have less than 1250 instances.

@crwhite14
Copy link
Member

Hi Andreas, below are the 98 datasets from Table 1 and the 57 datasets from Table 2. Please let us know if you have more questions.

datasets_table_1 = ['openml__visualizing_environmental__3602', 'openml__labor__4', 'openml__monks-problems-2__146065', 'openml__tic-tac-toe__49', 'openml__dermatology__35', 'openml__cardiotocography__9979', 'openml__lung-cancer__146024', 'openml__sonar__39', 'openml__anneal__2867', 'openml__analcatdata_chlamydia__3739', 'openml__iris__59', 'openml__irish__3543', 'openml__heart-c__48', 'openml__ionosphere__145984', 'openml__hayes-roth__146063', 'openml__fri_c3_100_5__3779', 'openml__fri_c0_100_5__3620', 'openml__analcatdata_authorship__3549', 'openml__rabe_266__3647', 'openml__balance-scale__11', 'openml__acute-inflammations__10089', 'openml__MiceProtein__146800', 'openml__banknote-authentication__10093', 'openml__mushroom__24', 'openml__kr-vs-kp__3', 'openml__analcatdata_boxing1__3540', 'openml__musk__3950', 'openml__transplant__3748', 'openml__cjs__14967', 'openml__synthetic_control__3512', 'openml__car-evaluation__146192', 'openml__fertility__9984', 'openml__postoperative-patient-data__146210', 'openml__breast-w__15', 'openml__wdbc__9946', 'openml__car__146821', 'openml__visualizing_livestock__3731', 'openml__mfeat-factors__12', 'openml__Satellite__167211', 'openml__colic__25', 'openml__lymph__10', 'openml__wall-robot-navigation__9960', 'openml__wilt__146820', 'openml__scene__3485', 'openml__mfeat-karhunen__16', 'openml__sick__3021', 'openml__dna__167140', 'openml__socmob__3797', 'openml__page-blocks__30', 'openml__PhishingWebsites__14952', 'openml__spambase__43', 'openml__splice__45', 'openml__churn__167141', 'openml__colic__27', 'openml__ecoli__145977', 'openml__semeion__9964', 'openml__ozone-level-8hr__9978', 'openml__heart-h__50', 'openml__pc1__3918', 'openml__qsar-biodeg__9957', 'openml__autos__9', 'openml__pc4__3902', 'openml__hill-valley__145847', 'openml__satimage__2074', 'openml__pc3__3903', 'openml__mfeat-fourier__14', 'openml__Australian__146818', 'openml__credit-approval__29', 'openml__cylinder-bands__14954', 'openml__mfeat-zernike__22', 'openml__kc2__3913', 'openml__bank-marketing__14965', 'openml__phoneme__9952', 'openml__elevators__3711', 'openml__breast-cancer__145799', 'openml__SpeedDating__146607', 'openml__kc1__3917', 'openml__adult-census__3953', 'openml__ilpd__9971', 'openml__vehicle__53', 'openml__ada_agnostic__3896', 'openml__tae__47', 'openml__blood-transfusion-service-center__10101', 'openml__jasmine__168911', 'openml__LED-display-domain-7digit__125921', 'openml__diabetes__37', 'openml__Click_prediction_small__190408', 'openml__profb__3561', 'openml__steel-plates-fault__146817', 'openml__jm1__3904', 'openml__glass__40', 'openml__dresses-sales__125920', 'openml__mfeat-morphological__18', 'openml__eucalyptus__2079', 'openml__libras__360948', 'openml__yeast__145793', 'openml__cmc__23', 'openml__analcatdata_dmft__3560']

datasets_table_2 = ["openml__Australian__146818", "openml__LED-display-domain-7digit__125921", "openml__MiceProtein__146800", "openml__acute-inflammations__10089", "openml__analcatdata_authorship__3549", "openml__analcatdata_boxing1__3540", "openml__analcatdata_chlamydia__3739", "openml__analcatdata_dmft__3560", "openml__anneal__2867", "openml__autos__9", "openml__balance-scale__11", "openml__blood-transfusion-service-center__10101", "openml__blood-transfusion-service-center__145836", "openml__breast-cancer__145799", "openml__breast-w__15", "openml__colic__25", "openml__colic__27", "openml__credit-approval__29", "openml__cylinder-bands__14954", "openml__dermatology__35", "openml__diabetes__37", "openml__dresses-sales__125920", "openml__ecoli__145977", "openml__eucalyptus__2079", "openml__fertility__9984", "openml__fri_c0_100_5__3620", "openml__fri_c3_100_5__3779", "openml__glass__40", "openml__hayes-roth__146063", "openml__heart-c__48", "openml__heart-h__50", "openml__hill-valley__145847", "openml__ilpd__9971", "openml__ionosphere__145984", "openml__iris__59", "openml__irish__3543", "openml__kc2__3913", "openml__labor__4", "openml__lung-cancer__146024", "openml__lymph__10", "openml__monks-problems-2__146065", "openml__pc1__3918", "openml__postoperative-patient-data__146210", "openml__profb__3561", "openml__qsar-biodeg__9957", "openml__rabe_266__3647", "openml__socmob__3797", "openml__sonar__39", "openml__synthetic_control__3512", "openml__tae__47", "openml__tic-tac-toe__49", "openml__transplant__3748", "openml__vehicle__53", "openml__visualizing_environmental__3602", "openml__visualizing_livestock__3731", "openml__wdbc__9946", "openml__yeast__145793"]

@LennartPurucker
Copy link

I just saw this issue. Are you aware that the datasets for Table 2 have a duplicate?
"openml__blood-transfusion-service-center__10101", "openml__blood-transfusion-service-center__145836"?

@duncanmcelfresh
Copy link
Collaborator

@LennartPurucker thanks for pointing this out - cc @crwhite14 . so we could remove the duplicate dataset from results that include it.

it looks like we accidentally pulled two different openML tasks (https://openml.org/search?type=task&id=145836 and https://openml.org/search?type=task&id=10101) which appear to be identical, because they are based on the same dataset (https://openml.org/search?type=data&id=1464)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants