0% found this document useful (0 votes)
69 views

OpenML Benchmarking Suites and The OpenML100

This document introduces the OpenML100, a benchmarking suite of 100 classification datasets carefully curated from thousands of datasets available on the OpenML platform. It aims to standardize machine learning benchmarks by providing datasets in uniform formats along with meta-data and evaluation procedures. Software libraries in Python, Java, and R allow downloading these datasets and sharing results online for comparison. The OpenML100 satisfies requirements such as dataset size and balance to focus on medium-sized classification problems.

Uploaded by

fmokadem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

OpenML Benchmarking Suites and The OpenML100

This document introduces the OpenML100, a benchmarking suite of 100 classification datasets carefully curated from thousands of datasets available on the OpenML platform. It aims to standardize machine learning benchmarks by providing datasets in uniform formats along with meta-data and evaluation procedures. Software libraries in Python, Java, and R allow downloading these datasets and sharing results online for comparison. The OpenML100 satisfies requirements such as dataset size and balance to focus on medium-sized classification problems.

Uploaded by

fmokadem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

OpenML100

OpenML Benchmarking Suites and the OpenML100

Bernd Bischl [email protected]


Giuseppe Casalicchio [email protected]
Matthias Feurer [email protected]
arXiv:1708.03731v1 [stat.ML] 11 Aug 2017

Frank Hutter [email protected]


Michel Lang [email protected]
Rafael G. Mantovani [email protected]
Jan N. van Rijn [email protected]
Joaquin Vanschoren [email protected]

Editor: To be determined

Abstract
We advocate the use of curated, comprehensive benchmark suites of machine learning
datasets, backed by standardized OpenML-based interfaces and complementary software
toolkits written in Python, Java and R. Major distinguishing features of OpenML bench-
mark suites are (a) ease of use through standardized data formats, APIs, and existing
client libraries; (b) machine-readable meta-information regarding the contents of the suite;
and (c) online sharing of results, enabling large scale comparisons. As a first such suite, we
propose the OpenML100, a machine learning benchmark suite of 100 classification datasets
carefully curated from the thousands of datasets available on OpenML.org.
Keywords: machine learning, benchmarking

1. A Brief History of Benchmarking Suites


Proper algorithm benchmarking is a hallmark of machine learning research. It allows us,
as a community, to track progress over time, to identify still challenging issues, and to
learn which algorithms are most appropriate for specific applications. However, we cur-
rently lack standardized, easily-accessible benchmark suites of datasets that are curated to
reflect important problem domains, practical to use, and that support a rigorous analysis of
performance results. This often results in suboptimal shortcuts in study designs, producing
rather small-scale, one-off experiments that should be interpreted with caution (Aha, 1992),
are hard to reproduce (Pedersen, 2008; Hirsh, 2008), and may even lead to contradictory
results (Keogh and Kasetty, 2003).
The machine learning field has long recognized the importance of dataset repositories.
The UCI repository (Lichman, 2013) offers a wide range of datasets, but it does not at-
tempt to make them available through a uniform format or API. The same holds for other
repositories, such as LIBSVM (Chang and Lin, 2011). mldata.org is a very popular repos-
itory that does provide an API to easily download datasets, and is readily integrated in
scikit-learn. However, it is no longer being maintained, and will very likely be merged with

1
Bischl. et al.

our OpenML. KEEL (Alcala et al., 2010) offers some benchmark data suites, including one
for imbalanced classification and one with data sets with missing values. It has a Java
toolkit and an R library for convenient access. Likewise, PMLB (Olson et al., 2017) is an-
other collection of datasets, with strong overlap to UCI, with tools to import them into
Python scripts. However, none of the above tools allows to add new datasets or easily share
and compare benchmarking results online.1 Other related benchmark collections include
UCR (Chen et al., 2015) for time series data, OpenAI gym (Brockman et al., 2016) for re-
inforcement learning problems, and Mulan (Tsoumakas et al., 2011) for multilabel datasets,
with some of the multilabel datasets already available on OpenML (Probst et al., 2017).
All of these existing repositories are rather well-curated, and for many years machine
learning researchers have benchmarked their algorithms on a subset of their data sets. How-
ever, most of them do not provide APIs for downloading data in standardized formats into
popular machine learning libraries and uploading and comparing the ensuing results. Hence,
large scale benchmarks that also build upon previous results of others are still the exception.

2. OpenML Benchmarking Suites


We advocate expanding on previous efforts by comprehensive benchmark suites backed by
the open machine learning platform OpenML (Vanschoren et al., 2013). Our goal is to sub-
stantially facilitate in-depth benchmarking by providing a standard set of datasets covering
a wide spectrum of domains and statistical properties, together with rich meta-data and
standardized evaluation procedures (i.e., we also provide unified data splits for resampling
methods). This eliminates guesswork, makes individual results more comparable, and allows
more standardized analysis of all results. In addition, we provide software libraries in sev-
eral programming languages to easily download these datasets, optionally download prior
benchmarking results for reuse and comparison, and to share your results online.
OpenML is an online platform for reproducible, collaborative machine learning exper-
iments and can be used to store and share all aspects of machine learning experiments,
including data, code, experiment parameters and results. All our datasets in OpenML are
provided in a uniform format, highlight issues such as unique-valued or constant features, in-
clude extensive meta-data for deeper analysis of evaluation results, and provide task-specific
meta-data, such as target features and predefined train-test splits.
Researchers can conveniently explore the datasets included in OpenML through com-
prehensive APIs to find suitable learning tasks for their planned experiments, depending on
required data set characteristics. These APIs allow, for instance, to find all high-dimensional
data sets with few observations and no missing values.

3. The OpenML100 Benchmarking Suite


On top of OpenML’s customizable functionality, we provide a new standard benchmark
suite of 100 high-quality datasets carefully curated from the many thousands available on
OpenML: the OpenML100.

1. The latter used to be possible with DELVE (http://www.cs.toronto.edu/~delve/) and mlcomp.org,


but both services are no longer maintained.

2
OpenML100

We selected classification datasets for this benchmarking suite to satisfy the following
requirements: (a) the number of observations are between 500 and 100 000 to focus on
medium-sized datasets, (b) the number of features does not exceed 5000 features to keep
the runtime of algorithms low, (c) the target attribute has at least two classes, and (d) the
ratio of the minority class and the majority class is above 0.05 (to eliminate highly im-
balanced datasets). We excluded datasets which (a) cannot be randomized via a 10-fold
cross-validation due to grouped samples, (b) are a subset of a larger dataset available on
OpenML, (c) have no source or reference available, (d) are created by binarization of re-
gression tasks or multiclass classification tasks, or (e) include sparse data (e.g., text mining
data sets). A detailed list of the data properties can be found on OpenML2 .

4. How to use the OpenML100


In this section we demonstrate how our dataset collection can be conveniently imported for
benchmarking using our client libraries in Python, Java and R. Figure 1 provides exemplary
code chunks for downloading the datasets and running a basic classifier in all three lan-
guages. In these examples, we use the Python library with scikit-learn (Pedregosa et al.,
2011), the R package (Casalicchio et al., 2017) with mlr (Bischl et al., 2016), and the Java
library with Weka (Hall et al., 2009). OpenML has also been integrated in MOA and Rapid-
Miner (van Rijn and Vanschoren, 2015).
OpenML works with the concept of tasks to facilitate comparable and reproducible
results. A task extends a dataset with task-specific information, such as target attributes
and evaluation procedures. Datasets and tasks are automatically downloaded at first use
and are afterwards cached locally. Studies combine a specific set of tasks and can also hold
all benchmarking results obtained on them. In the code examples, the OpenML100 tasks
are downloaded through the study with the same name. They also show how to access the
raw data set (although this is not needed to train a model), fit a simple classifier on the
defined data splits, and finally publish runs on the OpenML server. Note that the Java
implementation automatically uploads results to the server.

5. Creating new Benchmarking Suites


The set of datasets on OpenML.org can easily be extended, and additional OpenML bench-
mark suites, e.g., for regression and time-series data, can easily be created by defining sets
of datasets according to specific needs. Instructions for creating new benchmarking suites
can be found on https://www.openml.org. We currently envision two routes of exten-
sions: (a) facilitate the creation and versioning of these benchmark suites on OpenML.org;
and (b) adding automatic statistical analysis, visualization and reporting on the online plat-
form.
Acknowledgements This work has partly been supported by the European Research Coun-
cil (ERC) under the European Union’s Horizon 2020 research and innovation programme
under grant no. 716721, through grant #2015/03986-0 from the São Paulo Research Foun-
dation (FAPESP), as well as Priority Programme Autonomous Learning (SPP 1527, grant

2. https://www.openml.org/s/14

3
Bischl. et al.

1 import openml
2 import sklearn
3 benchmark_suite = openml.study.get_study('OpenML100','tasks') # obtain the benchmark suite
4 clf = sklearn.pipeline.Pipeline(steps=[('imputer',sklearn.preprocessing.Imputer()), ('estimator',
sklearn.tree.DecisionTreeClassifier())]) # build a sklearn classifier
5 for task_id in benchmark_suite.tasks: # iterate over all tasks
6 task = openml.tasks.get_task(task_id) # download the OpenML task
7 X, y = task.get_X_and_y() # get the data (not used in this example)
8 openml.config.apikey = 'FILL_IN_OPENML_API_KEY' # set the OpenML Api Key
9 run = openml.runs.run_model_on_task(task,clf) # run classifier on splits (requires API key)
10 score = run.get_metric_score(sklearn.metrics.accuracy_score) # print accuracy score
11 print('Data set: %s; Accuracy: %0.2f' % (task.get_dataset().name,score.mean()))
12 run.publish() # publish the experiment on OpenML (optional)
13 print('URL for run: %s/run/%d' %(openml.config.server,run.run_id))

(a) Python, available on https://github.com/openml/openml-python/

1 public static void runTasksAndUpload() throws Exception {


2 OpenmlConnector openml = new OpenmlConnector();
3 Study benchmarksuite = openml.studyGet("OpenML100", "tasks"); // obtain the benchmark suite
4 Classifier tree = new REPTree(); // build a Weka classifier
5 for (Integer taskId : benchmarksuite.getTasks()) { // iterate over all tasks
6 Task t = openml.taskGet(taskId); // download the OpenML task
7 Instances d = InstancesHelper.getDatasetFromTask(openml, t); // obtain the dataset
8 openml.setApiKey("FILL_IN_OPENML_API_KEY");
9 int runId = RunOpenmlJob.executeTask(openml, new WekaConfig(), taskId, tree);
10 Run run = openml.runGet(runId);}} // retrieve the uploaded run

(b) Java, available on Maven Central with artifact id ‘org.openml.openmlweka’

1 library(OpenML)
2 lrn = makeLearner('classif.rpart') # construct a simple CART classifier
3 task.ids = getOMLStudy('OpenML100')$tasks$task.id # obtain the list of suggested tasks
4 for (task.id in task.ids) { # iterate over all tasks
5 task = getOMLTask(task.id) # download single OML task
6 data = as.data.frame(task) # obtain raw data set
7 run = runTaskMlr(task, learner = lrn) # run constructed learner
8 setOMLConfig(apikey = 'FILL_IN_OPENML_API_KEY')
9 upload = uploadOMLRun(run) # upload and tag the run
10 }

(c) R, available on CRAN via package OpenML

Figure 1: Running classifiers on a task and (optionally) uploading the results. Uploading
requires the user to fill in an API key.

HU 1900/3-1) and Collaborative Research Center SFB 876/A3 from the German Research
Foundation (DFG).

4
OpenML100

References
D. W. Aha. Generalizing from case studies: A case study. Proceedings of the International Conference
on Machine Learning (ICML), pages 1–10, 1992.

J. Alcala, A. Fernandez, J. Luengo, J. Derrac, S. Garcia, L. Sanchez, and F. Herrera. Keel datamining
software tool: Data set repository, integration of algorithms and experimental analysis framework.
Journal of Multiple-Valued Logic and Soft Computing, 17(2-3):255âĂŞ–287, 2010.

B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E. Studerus, G. Casalicchio, and Z. M.


Jones. mlr: Machine learning in R. JMLR, 17(170):1–5, 2016.

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.


OpenAI Gym. arXiv:1606.01540, 2016.

G. Casalicchio, J. Bossek, M. Lang, D. Kirchhoff, P. Kerschke, B. Hofner, H. Seibold, J. Vanschoren,


and B. Bischl. OpenML: An R package to connect to the machine learning platform OpenML.
Computational Statistics, Jun 2017.

C. C. Chang and C. J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on
Intelligent Systems and Technology (TIST), 2(3):27, 2011.

Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista. The UCR time series
classification archive, July 2015. www.cs.ucr.edu/~eamonn/time_series_data/.
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA data
mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1):10–18, 2009.

H. Hirsh. Data mining research: Current status and future opportunities. Statistical Analysis and
Data Mining, 1(2):104–107, 2008.
E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and
empirical demonstration. Data Mining and Knowledge Discovery, 7(4):349–371, 2003.

M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.

R. S. Olson, W. La Cava, P. Orzechowski, R. J. Urbanowicz, and J. H. Moore. PMLB: A large


benchmark suite for machine learning evaluation and comparison. arXiv:1703.00512, 2017.

T. Pedersen. Empiricism is not a matter of faith. Computational Linguistics, 34:465–470, 2008.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-


hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning
Research, 12:2825–2830, 2011.
P. Probst, Q. Au, G. Casalicchio, C. Stachl, and B. Bischl. Multilabel classification with R package
mlr. The R Journal, 9(1):352–369, 2017.

G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas. Mulan: A java library for multi-
label learning. JMLR, pages 2411–2414, Jul 2011.
J. N. van Rijn and J. Vanschoren. Sharing RapidMiner workflows and experiments with OpenML.
In MetaSel@ PKDD/ECML, pages 93–103, 2015.

J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked science in machine
learning. SIGKDD Explorations, 15(2):49–60, 2013.

5
Bischl. et al.

Appendix A. Datasets Included in the OpenML100 Benchmark Suite

Table 1: Datasets included in the OpenML100 benchmark suite. For each dataset, we show:
the OpenML task id and name, the number of classes (nClass), features (nFeat) and obser-
vations (nObs), as well as the ratio of the minority and majority class sizes (ratioMinMaj).
Task id Name nClass nFeat nObs ratioMinMaj Task id Name nClass nFeat nObs ratioMinMaj
3 kr-vs-kp 2 37 3196 0.91 3913 kc2 2 22 522 0.26
6 letter 26 17 20000 0.90 3917 kc1 2 22 2109 0.18
11 balance-scale 3 5 625 0.17 3918 pc1 2 22 1109 0.07
12 mfeat-factors 10 217 2000 1.00 3946 KDDCup09_churn 2 231 50000 0.08
14 mfeat-fourier 10 77 2000 1.00 3948 KDDCup09_upselling 2 231 50000 0.08
15 breast-w 2 10 699 0.53 3950 musk 2 170 6598 0.18
16 mfeat-karhunen 10 65 2000 1.00 3954 MagicTelescope 2 12 19020 0.54
18 mfeat-morphological 10 7 2000 1.00 7592 adult 2 15 48842 0.31
20 mfeat-pixel 10 241 2000 1.00 9914 wilt 2 6 4839 0.06
21 car 4 7 1728 0.05 9946 wdbc 2 31 569 0.59
22 mfeat-zernike 10 48 2000 1.00 9950 micro-mass 20 1301 571 0.18
23 cmc 3 10 1473 0.53 9952 phoneme 2 6 5404 0.42
24 mushroom 2 23 8124 0.93 9954 one-hundred-plants-margin 100 65 1600 1.00
28 optdigits 10 65 5620 0.97 9955 one-hundred-plants-shape 100 65 1600 1.00
29 credit-a 2 16 690 0.80 9956 one-hundred-plants-texture 100 65 1599 0.94
31 credit-g 2 21 1000 0.43 9957 qsar-biodeg 2 42 1055 0.51
32 pendigits 10 17 10992 0.92 9960 wall-robot-navigation 4 25 5456 0.15
36 segment 7 20 2310 1.00 9964 semeion 10 257 1593 0.96
37 diabetes 2 9 768 0.54 9967 steel-plates-fault 2 34 1941 0.53
41 soybean 19 36 683 0.09 9968 tamilnadu-electricity 20 4 45781 0.48
43 spambase 2 58 4601 0.65 9970 hill-valley 2 101 1212 1.00
45 splice 3 62 3190 0.46 9971 ilpd 2 11 583 0.40
49 tic-tac-toe 2 10 958 0.53 9976 madelon 2 501 2600 1.00
53 vehicle 4 19 846 0.91 9977 nomao 2 119 34465 0.40
58 waveform-5000 3 41 5000 0.98 9978 ozone-level-8hr 2 73 2534 0.07
219 electricity 2 9 45312 0.74 9979 cardiotocography 10 36 2126 0.09
2074 satimage 6 37 6430 0.41 9980 climate-model-simulation-crashes 2 21 540 0.09
2079 eucalyptus 5 20 736 0.49 9981 cnae-9 9 857 1080 1.00
3021 sick 2 30 3772 0.07 9983 eeg-eye-state 2 15 14980 0.81
3022 vowel 11 13 990 1.00 9985 first-order-theorem-proving 6 52 6118 0.19
3481 isolet 26 618 7797 0.99 9986 gas-drift 6 129 13910 0.55
3485 scene 2 300 2407 0.22 10093 banknote-authentication 2 5 1372 0.80
3492 monks-problems-1 2 7 556 1.00 10101 blood-transfusion-service-center 2 5 748 0.31
3493 monks-problems-2 2 7 601 0.52 14964 artificial-characters 10 8 10218 0.42
3494 monks-problems-3 2 7 554 0.92 14965 bank-marketing 2 17 45211 0.13
3510 JapaneseVowels 9 15 9961 0.48 14966 Bioresponse 2 1777 3751 0.84
3512 synthetic_control 6 62 600 1.00 14967 cjs 6 35 2796 0.40
3543 irish 2 6 500 0.80 14968 cylinder-bands 2 40 540 0.73
3549 analcatdata_authorship 4 71 841 0.17 14969 GesturePhaseSegmentationProcessed 5 33 9873 0.34
3560 analcatdata_dmft 6 5 797 0.79 14970 har 6 562 10299 0.72
3561 profb 2 10 672 0.50 34536 Internet-Advertisements 2 1559 3279 0.16
3567 collins 15 24 500 0.07 34537 PhishingWebsites 2 31 11055 0.80
3573 mnist_784 10 785 70000 0.80 34538 MiceProtein 8 82 1080 0.70
3889 sylva_agnostic 2 217 14395 0.07 34539 Amazon_employee_access 2 10 32769 0.06
3891 gina_agnostic 2 971 3468 0.97 125920 dresses-sales 2 13 500 0.72
3896 ada_agnostic 2 49 4562 0.33 125921 LED-display-domain-7digit 10 8 500 0.65
3899 mozilla4 2 6 15545 0.49 125922 texture 11 41 5500 1.00
3902 pc4 2 38 1458 0.14 125923 Australian 2 15 690 0.80
3903 pc3 2 38 1563 0.11 146606 higgs 2 29 98050 0.89
3904 jm1 2 22 10885 0.24 146607 SpeedDating 2 123 8378 0.20

You might also like