OpenML Benchmarking Suites and The OpenML100
OpenML Benchmarking Suites and The OpenML100
Editor: To be determined
Abstract
We advocate the use of curated, comprehensive benchmark suites of machine learning
datasets, backed by standardized OpenML-based interfaces and complementary software
toolkits written in Python, Java and R. Major distinguishing features of OpenML bench-
mark suites are (a) ease of use through standardized data formats, APIs, and existing
client libraries; (b) machine-readable meta-information regarding the contents of the suite;
and (c) online sharing of results, enabling large scale comparisons. As a first such suite, we
propose the OpenML100, a machine learning benchmark suite of 100 classification datasets
carefully curated from the thousands of datasets available on OpenML.org.
Keywords: machine learning, benchmarking
1
Bischl. et al.
our OpenML. KEEL (Alcala et al., 2010) offers some benchmark data suites, including one
for imbalanced classification and one with data sets with missing values. It has a Java
toolkit and an R library for convenient access. Likewise, PMLB (Olson et al., 2017) is an-
other collection of datasets, with strong overlap to UCI, with tools to import them into
Python scripts. However, none of the above tools allows to add new datasets or easily share
and compare benchmarking results online.1 Other related benchmark collections include
UCR (Chen et al., 2015) for time series data, OpenAI gym (Brockman et al., 2016) for re-
inforcement learning problems, and Mulan (Tsoumakas et al., 2011) for multilabel datasets,
with some of the multilabel datasets already available on OpenML (Probst et al., 2017).
All of these existing repositories are rather well-curated, and for many years machine
learning researchers have benchmarked their algorithms on a subset of their data sets. How-
ever, most of them do not provide APIs for downloading data in standardized formats into
popular machine learning libraries and uploading and comparing the ensuing results. Hence,
large scale benchmarks that also build upon previous results of others are still the exception.
2
OpenML100
We selected classification datasets for this benchmarking suite to satisfy the following
requirements: (a) the number of observations are between 500 and 100 000 to focus on
medium-sized datasets, (b) the number of features does not exceed 5000 features to keep
the runtime of algorithms low, (c) the target attribute has at least two classes, and (d) the
ratio of the minority class and the majority class is above 0.05 (to eliminate highly im-
balanced datasets). We excluded datasets which (a) cannot be randomized via a 10-fold
cross-validation due to grouped samples, (b) are a subset of a larger dataset available on
OpenML, (c) have no source or reference available, (d) are created by binarization of re-
gression tasks or multiclass classification tasks, or (e) include sparse data (e.g., text mining
data sets). A detailed list of the data properties can be found on OpenML2 .
2. https://www.openml.org/s/14
3
Bischl. et al.
1 import openml
2 import sklearn
3 benchmark_suite = openml.study.get_study('OpenML100','tasks') # obtain the benchmark suite
4 clf = sklearn.pipeline.Pipeline(steps=[('imputer',sklearn.preprocessing.Imputer()), ('estimator',
sklearn.tree.DecisionTreeClassifier())]) # build a sklearn classifier
5 for task_id in benchmark_suite.tasks: # iterate over all tasks
6 task = openml.tasks.get_task(task_id) # download the OpenML task
7 X, y = task.get_X_and_y() # get the data (not used in this example)
8 openml.config.apikey = 'FILL_IN_OPENML_API_KEY' # set the OpenML Api Key
9 run = openml.runs.run_model_on_task(task,clf) # run classifier on splits (requires API key)
10 score = run.get_metric_score(sklearn.metrics.accuracy_score) # print accuracy score
11 print('Data set: %s; Accuracy: %0.2f' % (task.get_dataset().name,score.mean()))
12 run.publish() # publish the experiment on OpenML (optional)
13 print('URL for run: %s/run/%d' %(openml.config.server,run.run_id))
1 library(OpenML)
2 lrn = makeLearner('classif.rpart') # construct a simple CART classifier
3 task.ids = getOMLStudy('OpenML100')$tasks$task.id # obtain the list of suggested tasks
4 for (task.id in task.ids) { # iterate over all tasks
5 task = getOMLTask(task.id) # download single OML task
6 data = as.data.frame(task) # obtain raw data set
7 run = runTaskMlr(task, learner = lrn) # run constructed learner
8 setOMLConfig(apikey = 'FILL_IN_OPENML_API_KEY')
9 upload = uploadOMLRun(run) # upload and tag the run
10 }
Figure 1: Running classifiers on a task and (optionally) uploading the results. Uploading
requires the user to fill in an API key.
HU 1900/3-1) and Collaborative Research Center SFB 876/A3 from the German Research
Foundation (DFG).
4
OpenML100
References
D. W. Aha. Generalizing from case studies: A case study. Proceedings of the International Conference
on Machine Learning (ICML), pages 1–10, 1992.
J. Alcala, A. Fernandez, J. Luengo, J. Derrac, S. Garcia, L. Sanchez, and F. Herrera. Keel datamining
software tool: Data set repository, integration of algorithms and experimental analysis framework.
Journal of Multiple-Valued Logic and Soft Computing, 17(2-3):255âĂŞ–287, 2010.
C. C. Chang and C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on
Intelligent Systems and Technology (TIST), 2(3):27, 2011.
Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista. The UCR time series
classification archive, July 2015. www.cs.ucr.edu/~eamonn/time_series_data/.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA data
mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1):10–18, 2009.
H. Hirsh. Data mining research: Current status and future opportunities. Statistical Analysis and
Data Mining, 1(2):104–107, 2008.
E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and
empirical demonstration. Data Mining and Knowledge Discovery, 7(4):349–371, 2003.
G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas. Mulan: A java library for multi-
label learning. JMLR, pages 2411–2414, Jul 2011.
J. N. van Rijn and J. Vanschoren. Sharing RapidMiner workflows and experiments with OpenML.
In MetaSel@ PKDD/ECML, pages 93–103, 2015.
J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked science in machine
learning. SIGKDD Explorations, 15(2):49–60, 2013.
5
Bischl. et al.
Table 1: Datasets included in the OpenML100 benchmark suite. For each dataset, we show:
the OpenML task id and name, the number of classes (nClass), features (nFeat) and obser-
vations (nObs), as well as the ratio of the minority and majority class sizes (ratioMinMaj).
Task id Name nClass nFeat nObs ratioMinMaj Task id Name nClass nFeat nObs ratioMinMaj
3 kr-vs-kp 2 37 3196 0.91 3913 kc2 2 22 522 0.26
6 letter 26 17 20000 0.90 3917 kc1 2 22 2109 0.18
11 balance-scale 3 5 625 0.17 3918 pc1 2 22 1109 0.07
12 mfeat-factors 10 217 2000 1.00 3946 KDDCup09_churn 2 231 50000 0.08
14 mfeat-fourier 10 77 2000 1.00 3948 KDDCup09_upselling 2 231 50000 0.08
15 breast-w 2 10 699 0.53 3950 musk 2 170 6598 0.18
16 mfeat-karhunen 10 65 2000 1.00 3954 MagicTelescope 2 12 19020 0.54
18 mfeat-morphological 10 7 2000 1.00 7592 adult 2 15 48842 0.31
20 mfeat-pixel 10 241 2000 1.00 9914 wilt 2 6 4839 0.06
21 car 4 7 1728 0.05 9946 wdbc 2 31 569 0.59
22 mfeat-zernike 10 48 2000 1.00 9950 micro-mass 20 1301 571 0.18
23 cmc 3 10 1473 0.53 9952 phoneme 2 6 5404 0.42
24 mushroom 2 23 8124 0.93 9954 one-hundred-plants-margin 100 65 1600 1.00
28 optdigits 10 65 5620 0.97 9955 one-hundred-plants-shape 100 65 1600 1.00
29 credit-a 2 16 690 0.80 9956 one-hundred-plants-texture 100 65 1599 0.94
31 credit-g 2 21 1000 0.43 9957 qsar-biodeg 2 42 1055 0.51
32 pendigits 10 17 10992 0.92 9960 wall-robot-navigation 4 25 5456 0.15
36 segment 7 20 2310 1.00 9964 semeion 10 257 1593 0.96
37 diabetes 2 9 768 0.54 9967 steel-plates-fault 2 34 1941 0.53
41 soybean 19 36 683 0.09 9968 tamilnadu-electricity 20 4 45781 0.48
43 spambase 2 58 4601 0.65 9970 hill-valley 2 101 1212 1.00
45 splice 3 62 3190 0.46 9971 ilpd 2 11 583 0.40
49 tic-tac-toe 2 10 958 0.53 9976 madelon 2 501 2600 1.00
53 vehicle 4 19 846 0.91 9977 nomao 2 119 34465 0.40
58 waveform-5000 3 41 5000 0.98 9978 ozone-level-8hr 2 73 2534 0.07
219 electricity 2 9 45312 0.74 9979 cardiotocography 10 36 2126 0.09
2074 satimage 6 37 6430 0.41 9980 climate-model-simulation-crashes 2 21 540 0.09
2079 eucalyptus 5 20 736 0.49 9981 cnae-9 9 857 1080 1.00
3021 sick 2 30 3772 0.07 9983 eeg-eye-state 2 15 14980 0.81
3022 vowel 11 13 990 1.00 9985 first-order-theorem-proving 6 52 6118 0.19
3481 isolet 26 618 7797 0.99 9986 gas-drift 6 129 13910 0.55
3485 scene 2 300 2407 0.22 10093 banknote-authentication 2 5 1372 0.80
3492 monks-problems-1 2 7 556 1.00 10101 blood-transfusion-service-center 2 5 748 0.31
3493 monks-problems-2 2 7 601 0.52 14964 artificial-characters 10 8 10218 0.42
3494 monks-problems-3 2 7 554 0.92 14965 bank-marketing 2 17 45211 0.13
3510 JapaneseVowels 9 15 9961 0.48 14966 Bioresponse 2 1777 3751 0.84
3512 synthetic_control 6 62 600 1.00 14967 cjs 6 35 2796 0.40
3543 irish 2 6 500 0.80 14968 cylinder-bands 2 40 540 0.73
3549 analcatdata_authorship 4 71 841 0.17 14969 GesturePhaseSegmentationProcessed 5 33 9873 0.34
3560 analcatdata_dmft 6 5 797 0.79 14970 har 6 562 10299 0.72
3561 profb 2 10 672 0.50 34536 Internet-Advertisements 2 1559 3279 0.16
3567 collins 15 24 500 0.07 34537 PhishingWebsites 2 31 11055 0.80
3573 mnist_784 10 785 70000 0.80 34538 MiceProtein 8 82 1080 0.70
3889 sylva_agnostic 2 217 14395 0.07 34539 Amazon_employee_access 2 10 32769 0.06
3891 gina_agnostic 2 971 3468 0.97 125920 dresses-sales 2 13 500 0.72
3896 ada_agnostic 2 49 4562 0.33 125921 LED-display-domain-7digit 10 8 500 0.65
3899 mozilla4 2 6 15545 0.49 125922 texture 11 41 5500 1.00
3902 pc4 2 38 1458 0.14 125923 Australian 2 15 690 0.80
3903 pc3 2 38 1563 0.11 146606 higgs 2 29 98050 0.89
3904 jm1 2 22 10885 0.24 146607 SpeedDating 2 123 8378 0.20