A lean playground for experimenting with CUDA GPU kernels.
Directory highlights
benchmarks/– micro-benchmarks + visualisationsnew_kernels/– hand-tuned CUDA kernels (LayerNorm, Softmax, …)evals/– PyTest correctness & regression suiteagents/– LLM pipeline that auto-writes kernels for KernelBench
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt # Dependencies including Triton (for benchmarking), PyTorch, plotting libsGPU prerequisites
- CUDA 11.8+ drivers
- Compute Capability ≥ 7.0 (RTX 30-series, A100/H100, …)
pytest -q evals # FP32 + BF16 when supportedTests compare the CUDA kernels against the PyTorch reference with strict
assert_verbose_allclose tolerances.
For the common cases you do not need to pass any arguments – just execute one of the convenience shell scripts and grab a coffee:
# Forward-only roofline runs
bash benchmarks/run_layer_norm_sol.sh
bash benchmarks/run_softmax_sol.sh
bash benchmarks/run_diagonal_matmul_sol.sh
bash benchmarks/run_fused_linear_rowsum_sol.shEach script will
- call the corresponding
benchmarks/scripts/benchmark_*.pyfile, - append results to
benchmarks/data/all_benchmark_data.csv, - auto-generate PNGs in
benchmarks/visualizations/(one per extra-config).
Prefer explicit flags? You can run the python scripts directly:
python benchmarks/scripts/benchmark_layer_norm.py --overwrite
python benchmarks/benchmark_visualizer.py \
--kernel-name layer_norm --metric-name speed \
--kernel-operation-mode forward --displayextra_benchmark_config unless
--extra-config-filter is supplied. Expect several PNGs.
Full-scale regeneration (~25 min on an H100):
source venv/bin/activate # ensure deps are present
for s in benchmarks/scripts/benchmark_*.py; do python "$s" --overwrite; done
for k in fused_linear_rowsum layer_norm softmax diag_matmul; do
python benchmarks/benchmark_visualizer.py --kernel-name "$k" --metric-name speed
python benchmarks/benchmark_visualizer.py --kernel-name "$k" --metric-name memory || true
doneHardware: Ampere or newer GPU with ≥40 GB VRAM to cover the largest cases.
Generated all_benchmark_data.csv should match the committed copy (timestamps differ).