Repository for reproducible benchmarking of database-like operations in single-node environment.
Benchmark report is available at h2oai.github.io/db-benchmark.
We focused mainly on portability and reproducibility. Benchmark is routinely re-run to present up-to-date timings. Most of solutions used are automatically upgraded to their stable or development versions.
This benchmark is meant to compare scalability both in data volume and data complexity.
Contribution and feedback are very welcome!
- groupby
- join
- sort
- read
- edit
path.envand setjuliaandjavapaths - if solution uses python create new
virtualenvas$solution/py-$solution, example forpandasusevirtualenv pandas/py-pandas --python=/usr/bin/python3.6 - install every solution (if needed activate each
virtualenv) - edit
run.confto define solutions and tasks to benchmark - generate data, for
groupbyuseRscript groupby-datagen.R 1e7 1e2 0 0to createG1_1e7_1e2_0_0.csv, re-save to binary data where needed, createdatadirectory and keep all data files there - edit
data.csvto define data sizes to benchmark usingactiveflag - start benchmark with
./run.sh
- generate data (see related point above)
- set data name env var, for example in
groupbyuse something likeexport SRC_GRP_LOCAL=G1_1e7_1e2_0_0 - if solution uses python activate
virtualenvof a solution - enter interactive console and run lines of script interactively
cuDF- use
condainstead ofvirtualenv
- use
ClickHouse- generate data having extra primary key column according to
clickhouse/setup-clickhouse.sh - follow "reproduce interactive environment" section from
clickhouse/setup-clickhouse.sh
- generate data having extra primary key column according to
- setting up r3-8xlarge: 244GB RAM, 32 cores: Amazon EC2 for beginners
- (slightly outdated) full reproduce script on clean Ubuntu 16.04: repro.sh
- Timings for some solutions might be missing for particular datasizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutions might also run out of memory when running benchmark script which results the process to be killed by OS. Lastly we also added timeout for single benchmark script to run, once timeout value is reached script is terminated.