Forked from h2oai/db-benchmark
Repository for reproducible benchmarking of database-like operations in single-node environment.
Benchmark report is available at duckdblabs.github.io/db-benchmark.
We focused mainly on portability and reproducibility. Benchmark is routinely re-run to present up-to-date timings. Most of solutions used are automatically upgraded to their stable or development versions.
This benchmark is meant to compare scalability both in data volume and data complexity.
Contribution and feedback are very welcome!
- groupby
- join
- groupby2014
If you would like your solution to be included, feel free to file a PR with the necessary setup-solution/ver-solution/groupby-solution/join-solution scripts. The team at duckdblabs approves the PR it will be merged.
- if solution uses python create new
virtualenvas$solution/py-$solution, example forpandasusevirtualenv pandas/py-pandas --python=/usr/bin/python3.10 - install every solution, follow
$solution/setup-$solution.shscripts by hand, they are not automatic scripts. - edit
run.confto define solutions and tasks to benchmark - generate data, for
groupbyuseRscript _data/groupby-datagen.R 1e7 1e2 0 0to createG1_1e7_1e2_0_0.csv, re-save to binary format where needed (see below), createdatadirectory and keep all data files there - edit
_control/data.csvto define data sizes to benchmark usingactiveflag - ensure SWAP is disabled and ClickHouse server is not yet running
- start benchmark with
./run.sh
- install solution software
- for python we recommend to use
virtualenvfor better isolation - for R ensure that library is installed in a solution subdirectory, so that
library("dplyr", lib.loc="./dplyr/r-dplyr")orlibrary("data.table", lib.loc="./datatable/r-datatable")works - note that some solutions may require another to be installed to speed-up csv data load, for example,
dplyrrequiresdata.tableand similarlypandasrequires (py)datatable
- for python we recommend to use
- generate data using
_data/*-datagen.Rscripts, for example,Rscript _data/groupby-datagen.R 1e7 1e2 0 0createsG1_1e7_1e2_0_0.csv, put data files indatadirectory - run benchmark for a single solution using
./_launcher/solution.R --solution=data.table --task=groupby --nrow=1e7 - run other data cases by passing extra parameters
--k=1e2 --na=0 --sort=0 - use
--quiet=trueto suppress script's output and print timings only, using--print=question,run,time_secspecify columns to be printed to console, to print all use--print=* - use
--out=time.csvto write timings to a file rather than console
- install software in expected location, details above
- ensure data name to be used in env var below is present in
./datadir - source python virtual environment if needed
- call
SRC_DATANAME=G1_1e7_1e2_0_0 R, if desired replaceRwithpythonorjulia - proceed pasting code from benchmark script
- setting up m4.10xlarge: 160GB RAM, 32 cores: Amazon link
- Full reproduce script on clean Ubuntu 22.04: _utils/repro.sh
Timings for solutions from before the fork have been deleted. You can still view them on the original h2oai/db-benchmark fork. Including these timings in report generation resulted in errors, and since all libraries have been updated and benchmarked using new hardware, the decision was made to start a new results file. Timings for some solutions might be missing for particular data sizes or questions. Some functions are not yet implemented in all solutions so we were unable to answer all questions in all solutions. Some solutions might also run out of memory when running benchmark script which results the process to be killed by OS. There is also a timeout for single benchmark script to run, once the timeout value is reached script is terminated. Please check exceptions label in the original h2oai repository for a list of issues/defects in solutions, that makes us unable to provide all timings. There is also no documentation label that lists issues that are blocked by missing documentation in solutions we are benchmarking.