Flash-Kernels

A curated collection of highly-optimized CUDA kernels produced by a multi-agent code-generation system.

Repository layout

Flash-Kernels/
├── kernels/              # Raw CUDA source for each standalone kernel
│   ├── common.cuh        # Shared device utilities (e.g. warp reductions)
│   ├── vectorized_layernorm.cu
│   ├── fused_linear_rowsum.cu
│   ├── softmax.cu
│   └── broadcast_multiply.cu
├── docs/
│   └── APPENDIX.md       # Full technical appendix from the original blog post
├── CMakeLists.txt        # Simple CUDA build script (optional)
└── .gitignore            # Standard ignores for build artefacts

Getting started

These kernels are provided as reference implementations. If you simply want to read the code, just open the files in kernels/.

To compile the kernels into a static library (libflash_kernels.a):

# From the repository root
cmake -B build .
cmake --build build -j

You will need:

CUDA 11.4+ (tested up to 12.3)
CMake >= 3.18
A recent C++17 compiler (gcc 10+, clang 12+, or MSVC 2019)

Feel free to open issues or PRs if you port these kernels to other GPU architectures or frameworks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flash-Kernels

Repository layout

Getting started

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
kernels		kernels
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Lossfunk/Flash-Kernels

Folders and files

Latest commit

History

Repository files navigation

Flash-Kernels

Repository layout

Getting started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages