Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
exclude: ^(KernelBench/|results/.*\.json$)
- id: end-of-file-fixer
exclude: ^(KernelBench/|results/.*\.json$)
- id: check-yaml
- id: check-toml
- id: check-added-large-files

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.2.2
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
exclude: ^(KernelBench/|src/kernelbench/prompts/model_ex_2\.py|src/kernelbench/prompts/model_new_ex_2\.py|results/.*\.json$)
- id: ruff-format
exclude: ^(KernelBench/|src/kernelbench/prompts/model_ex_2\.py|src/kernelbench/prompts/model_new_ex_2\.py|results/.*\.json$)

- repo: https://github.com/psf/black
rev: 24.2.0
hooks:
- id: black
exclude: ^(KernelBench/|src/kernelbench/prompts/model_ex_2\.py|src/kernelbench/prompts/model_new_ex_2\.py|results/.*\.json$)
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.10
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
SOFTWARE.
43 changes: 28 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ We construct Kernel Bench to have 4 Levels of categories:
- **Level 2 🔗**: Simple fusion patterns (100 Problems)
A fused kernel would be faster than separated kernels (Conv + Bias + ReLU, Matmul + Scale + Sigmoid)
- **Level 3 ⚛️**: Full model architectures (50 Problems)
Optimize entire model architectures end-to-end (MobileNet, VGG, MiniGPT, Mamba)
- **Level 4 🤗**: Level Hugging Face
Optimize entire model architectures end-to-end (MobileNet, VGG, MiniGPT, Mamba)
- **Level 4 🤗**: Level Hugging Face
Optimize whole model architectures from HuggingFace

## ⚖️ Evaluation
Expand All @@ -27,9 +27,9 @@ To evaluate model-generated kernels, we need to check if they:
- **is correct ✅**: check against reference torch operators `n_correctness` times on randomized inputs.
- **is performant ⏱️**: compare against reference torch operators `n_trial` times to measure speedup between runtimes.

Check out `src/eval.py` for details on how we implement correctness check and timing.
Check out `src/eval.py` for details on how we implement correctness check and timing.

We provide a convenient script `scripts/run_and_check.py` to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a model-generated kernel.
We provide a convenient script `scripts/run_and_check.py` to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a model-generated kernel.

#### Overall Benchmark Metric

Expand All @@ -44,7 +44,7 @@ You can increase speedup threshold `p` to make the task more challenging.

#### Compute Overall Benchmark Performance

We provide a script `scripts/greedy_analysis.py` to compute the overall benchmark performance.
We provide a script `scripts/greedy_analysis.py` to compute the overall benchmark performance.
Since we need to capture **both** correctness and performance, we use a metric `fast_p`: fraction of tasks that are both correct and have a speedup greater than threshold `p`; speedup is computed as the ratio of PyTorch reference wall-clock time to generated kernel time.

<!-- TODO: update to provide fast_p measurement script -->
Expand All @@ -56,11 +56,11 @@ KernelBench/
├── assets/
├── KernelBench/ # Benchmark dataset files
├── src/ # KernelBench logic code
│ ├── unit_tests/
│ ├── unit_tests/
│ ├── prompts/
│ ├── ....
├── scripts/ # helpful scripts to run the benchmark
├── results/ # baseline times across hardware
├── results/ # baseline times across hardware
├── runs/ # where your runs will be stored
```

Expand All @@ -69,16 +69,29 @@ KernelBench/
conda create --name kernel-bench python=3.10
conda activate kernel-bench
pip install -r requirements.txt
pip install -e .
pip install -e .
```

### Alternative setup using `uv`
You can also use `uv` as a faster alternative to conda and pip:

```
# Install a Python environment using uv
uv python install 3.10

# Create a virtual environment and install dependencies
uv venv
source .venv/bin/activate
uv pip install -e .
```

To call LLM API providers, set your `{INFERENCE_SERVER_PROVIDER}_API_KEY` API key.

Running and profiling kernels require a GPU.
Running and profiling kernels require a GPU.
If you don't have GPU available locally, you can set up [Modal](https://modal.com/). Set up your modal token after creating an account by running `modal token new`. Then, use the `generate_and_eval_single_sample_modal.py` script.

## 🚀 Usage
### Run on a single problem
### Run on a single problem
It is easier to get started with a single problem. This will fetch the problem, generate a sample, and evaluate the sample.

```
Expand All @@ -90,7 +103,7 @@ python3 scripts/generate_and_eval_single_sample.py dataset_src="huggingface" lev
# add .verbose_logging for more visbility
```

### Run on all problems
### Run on all problems

```
# 1. Generate responses and store kernels locally to runs/{run_name} directory
Expand All @@ -103,7 +116,7 @@ python3 scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=lo
# add build_cache=True and num_cpu_workers=<num_cpu_workers> to the command
```
### Analyze the eval results to compute Benchmark Performance
We provide `scripts/benchmark_eval_analysis.py` to analyze the eval results to compute success rate, timing metric, and overall benchmark performance `fast_p`.
We provide `scripts/benchmark_eval_analysis.py` to analyze the eval results to compute success rate, timing metric, and overall benchmark performance `fast_p`.

```
python3 scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch
Expand All @@ -114,7 +127,7 @@ We provide some reference baseline times a variety of NVIDIA GPUs across generat
## 🛣️ Upcoming Roadmap
- [ ] Triton Variant (Ongoing)
- [ ] Easy to use CoLab Notebook Example
- [ ] Push button flow on Modal / Cloud Provider
- [ ] Push button flow on Modal / Cloud Provider
- [ ] Integrate with more frameworks, such as [ThunderKittens](https://github.com/HazyResearch/ThunderKittens)
- [ ] Add backward pass
- [ ] Integrate with toolchains such as NCU
Expand All @@ -134,12 +147,12 @@ MIT. Check `LICENSE.md` for more details.
## Citation
```bibtex
@misc{ouyang2025kernelbenchllmswriteefficient,
title={KernelBench: Can LLMs Write Efficient GPU Kernels?},
title={KernelBench: Can LLMs Write Efficient GPU Kernels?},
author={Anne Ouyang and Simon Guo and Simran Arora and Alex L. Zhang and William Hu and Christopher Ré and Azalia Mirhoseini},
year={2025},
eprint={2502.10517},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.10517},
url={https://arxiv.org/abs/2502.10517},
}
```
14 changes: 14 additions & 0 deletions curl.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
curl -X POST "https://tcapelle--kernel-benchmark-server-benchmarkservice-fastapi-app.modal.run/benchmark" \
-F "ref_file=@src/kernelbench/prompts/model_ex_1.py" \
-F "kernel_file=@src/kernelbench/prompts/model_new_ex_1.py" \
-F "num_correct_trials=5" \
-F "num_perf_trials=100" \
-F "verbose=false" | python -m json.tool


curl -X POST "https://tcapelle--kernel-benchmark-server-benchmarkservice-f-d98c17-dev.modal.run/benchmark" \
-F "ref_file=@src/kernelbench/prompts/model_ex_1.py" \
-F "kernel_file=@src/kernelbench/prompts/model_new_ex_1.py" \
-F "num_correct_trials=5" \
-F "num_perf_trials=100" \
-F "verbose=false" | python -m json.tool
79 changes: 79 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
[project]
name = "kernelbench"
version = "0.1.0"
description = "Benchmarking Cuda/Triton Kernels made easy"
readme = "README.md"
requires-python = "==3.10.*"
dependencies = [
"anthropic>=0.34.2",
"archon-ai>=0.1.4",
"datasets>=3.5.0",
"einops>=0.8.1",
"google-generativeai>=0.8.4",
"modal>=0.73.136",
"ninja>=1.11.1.4",
"numpy>=2.2.4",
"openai>=1.69.0",
"packaging>=24.2",
"pydra-config>=0.0.14",
"pytest>=8.3.5",
"together>=1.5.4",
"torch==2.5.0",
"tqdm>=4.67.1",
"transformers>=4.50.3",
]


[tool.setuptools]
package-dir = {"" = "src"}
packages = ["kernelbench"]

[project.optional-dependencies]
dev = [
"weave>=0.51.39",
"black>=24.2.0",
"ruff>=0.2.2",
"pre-commit>=3.5.0",
"pytest>=8.3.5",
]

[tool.black]
line-length = 88
target-version = ["py310"]
include = '\.pyi?$'
exclude = '''
/(
\.git
| \.hg
| \.mypy_cache
| \.tox
| \.venv
| _build
| buck-out
| build
| dist
| KernelBench
| results
)/
'''

[tool.ruff]
line-length = 88
target-version = "py310"
select = ["E", "F", "I", "W", "B", "C4", "N"]
ignore = []
exclude = [
".git",
".venv",
"dist",
"build",
"KernelBench",
"results",
]

[tool.ruff.isort]
known-first-party = ["kernelbench"]

[tool.pytest.ini_options]
testpaths = ["tests"]
python_files = "test_*.py"
5 changes: 5 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
[pytest]
testpaths = tests
python_files = test_*.py
python_functions = test_*
addopts = -v
2 changes: 1 addition & 1 deletion results/timing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,4 +56,4 @@ Learn more about Torch Compile [backends](https://pytorch.org/docs/stable/torch.

Thank you to [@PaliC](https://github.com/PaliC) from the PyTorch team for the exerptise on various Torch Configurations.

Thanks to [Modal](https://modal.com/) for sponsoring compute credits for us to collect runtime baseline on a vareity range of NVIDIA GPUs.
Thanks to [Modal](https://modal.com/) for sponsoring compute credits for us to collect runtime baseline on a vareity range of NVIDIA GPUs.
Loading