Which datasets are used for main paper (98 datasets) and "small data" (57 datasets) #100

amueller · 2024-03-27T00:52:55Z

Hi.
I'm trying to compare to some of the results in your work, but it's not clear to me which datasets were use for Table 1 and Table 2.
The Datasets A file contains 108 datasets, and the Datasets B file contains 69 datasets, so I'm not sure which the 98 ones are.
Really I care more about the 57 small datasets, but cutting off at those with 1250 or less instances doesn't yield 57 for either A or B or the combination.

amueller · 2024-03-27T00:57:11Z

The "easy_import" list seems to contain 175 classification tasks, 69 of which have less than 1250 instances.

crwhite14 · 2024-03-29T21:09:52Z

Hi Andreas, below are the 98 datasets from Table 1 and the 57 datasets from Table 2. Please let us know if you have more questions.

datasets_table_1 = ['openml__visualizing_environmental__3602', 'openml__labor__4', 'openml__monks-problems-2__146065', 'openml__tic-tac-toe__49', 'openml__dermatology__35', 'openml__cardiotocography__9979', 'openml__lung-cancer__146024', 'openml__sonar__39', 'openml__anneal__2867', 'openml__analcatdata_chlamydia__3739', 'openml__iris__59', 'openml__irish__3543', 'openml__heart-c__48', 'openml__ionosphere__145984', 'openml__hayes-roth__146063', 'openml__fri_c3_100_5__3779', 'openml__fri_c0_100_5__3620', 'openml__analcatdata_authorship__3549', 'openml__rabe_266__3647', 'openml__balance-scale__11', 'openml__acute-inflammations__10089', 'openml__MiceProtein__146800', 'openml__banknote-authentication__10093', 'openml__mushroom__24', 'openml__kr-vs-kp__3', 'openml__analcatdata_boxing1__3540', 'openml__musk__3950', 'openml__transplant__3748', 'openml__cjs__14967', 'openml__synthetic_control__3512', 'openml__car-evaluation__146192', 'openml__fertility__9984', 'openml__postoperative-patient-data__146210', 'openml__breast-w__15', 'openml__wdbc__9946', 'openml__car__146821', 'openml__visualizing_livestock__3731', 'openml__mfeat-factors__12', 'openml__Satellite__167211', 'openml__colic__25', 'openml__lymph__10', 'openml__wall-robot-navigation__9960', 'openml__wilt__146820', 'openml__scene__3485', 'openml__mfeat-karhunen__16', 'openml__sick__3021', 'openml__dna__167140', 'openml__socmob__3797', 'openml__page-blocks__30', 'openml__PhishingWebsites__14952', 'openml__spambase__43', 'openml__splice__45', 'openml__churn__167141', 'openml__colic__27', 'openml__ecoli__145977', 'openml__semeion__9964', 'openml__ozone-level-8hr__9978', 'openml__heart-h__50', 'openml__pc1__3918', 'openml__qsar-biodeg__9957', 'openml__autos__9', 'openml__pc4__3902', 'openml__hill-valley__145847', 'openml__satimage__2074', 'openml__pc3__3903', 'openml__mfeat-fourier__14', 'openml__Australian__146818', 'openml__credit-approval__29', 'openml__cylinder-bands__14954', 'openml__mfeat-zernike__22', 'openml__kc2__3913', 'openml__bank-marketing__14965', 'openml__phoneme__9952', 'openml__elevators__3711', 'openml__breast-cancer__145799', 'openml__SpeedDating__146607', 'openml__kc1__3917', 'openml__adult-census__3953', 'openml__ilpd__9971', 'openml__vehicle__53', 'openml__ada_agnostic__3896', 'openml__tae__47', 'openml__blood-transfusion-service-center__10101', 'openml__jasmine__168911', 'openml__LED-display-domain-7digit__125921', 'openml__diabetes__37', 'openml__Click_prediction_small__190408', 'openml__profb__3561', 'openml__steel-plates-fault__146817', 'openml__jm1__3904', 'openml__glass__40', 'openml__dresses-sales__125920', 'openml__mfeat-morphological__18', 'openml__eucalyptus__2079', 'openml__libras__360948', 'openml__yeast__145793', 'openml__cmc__23', 'openml__analcatdata_dmft__3560']

datasets_table_2 = ["openml__Australian__146818", "openml__LED-display-domain-7digit__125921", "openml__MiceProtein__146800", "openml__acute-inflammations__10089", "openml__analcatdata_authorship__3549", "openml__analcatdata_boxing1__3540", "openml__analcatdata_chlamydia__3739", "openml__analcatdata_dmft__3560", "openml__anneal__2867", "openml__autos__9", "openml__balance-scale__11", "openml__blood-transfusion-service-center__10101", "openml__blood-transfusion-service-center__145836", "openml__breast-cancer__145799", "openml__breast-w__15", "openml__colic__25", "openml__colic__27", "openml__credit-approval__29", "openml__cylinder-bands__14954", "openml__dermatology__35", "openml__diabetes__37", "openml__dresses-sales__125920", "openml__ecoli__145977", "openml__eucalyptus__2079", "openml__fertility__9984", "openml__fri_c0_100_5__3620", "openml__fri_c3_100_5__3779", "openml__glass__40", "openml__hayes-roth__146063", "openml__heart-c__48", "openml__heart-h__50", "openml__hill-valley__145847", "openml__ilpd__9971", "openml__ionosphere__145984", "openml__iris__59", "openml__irish__3543", "openml__kc2__3913", "openml__labor__4", "openml__lung-cancer__146024", "openml__lymph__10", "openml__monks-problems-2__146065", "openml__pc1__3918", "openml__postoperative-patient-data__146210", "openml__profb__3561", "openml__qsar-biodeg__9957", "openml__rabe_266__3647", "openml__socmob__3797", "openml__sonar__39", "openml__synthetic_control__3512", "openml__tae__47", "openml__tic-tac-toe__49", "openml__transplant__3748", "openml__vehicle__53", "openml__visualizing_environmental__3602", "openml__visualizing_livestock__3731", "openml__wdbc__9946", "openml__yeast__145793"]

LennartPurucker · 2024-06-04T09:02:55Z

I just saw this issue. Are you aware that the datasets for Table 2 have a duplicate?
"openml__blood-transfusion-service-center__10101", "openml__blood-transfusion-service-center__145836"?

duncanmcelfresh · 2024-09-23T13:54:41Z

@LennartPurucker thanks for pointing this out - cc @crwhite14 . so we could remove the duplicate dataset from results that include it.

it looks like we accidentally pulled two different openML tasks (https://openml.org/search?type=task&id=145836 and https://openml.org/search?type=task&id=10101) which appear to be identical, because they are based on the same dataset (https://openml.org/search?type=data&id=1464)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which datasets are used for main paper (98 datasets) and "small data" (57 datasets) #100

Which datasets are used for main paper (98 datasets) and "small data" (57 datasets) #100

amueller commented Mar 27, 2024

amueller commented Mar 27, 2024

crwhite14 commented Mar 29, 2024

LennartPurucker commented Jun 4, 2024

duncanmcelfresh commented Sep 23, 2024

Which datasets are used for main paper (98 datasets) and "small data" (57 datasets) #100

Which datasets are used for main paper (98 datasets) and "small data" (57 datasets) #100

Comments

amueller commented Mar 27, 2024

amueller commented Mar 27, 2024

crwhite14 commented Mar 29, 2024

LennartPurucker commented Jun 4, 2024

duncanmcelfresh commented Sep 23, 2024