A curated collection of highly-optimized CUDA kernels produced by a multi-agent code-generation system.
Flash-Kernels/
├── kernels/ # Raw CUDA source for each standalone kernel
│ ├── common.cuh # Shared device utilities (e.g. warp reductions)
│ ├── vectorized_layernorm.cu
│ ├── fused_linear_rowsum.cu
│ ├── softmax.cu
│ └── broadcast_multiply.cu
├── docs/
│ └── APPENDIX.md # Full technical appendix from the original blog post
├── CMakeLists.txt # Simple CUDA build script (optional)
└── .gitignore # Standard ignores for build artefacts
These kernels are provided as reference implementations. If you simply want to read the code, just open the files in kernels/.
To compile the kernels into a static library (libflash_kernels.a):
# From the repository root
cmake -B build .
cmake --build build -jYou will need:
- CUDA 11.4+ (tested up to 12.3)
- CMake >= 3.18
- A recent C++17 compiler (gcc 10+, clang 12+, or MSVC 2019)
Feel free to open issues or PRs if you port these kernels to other GPU architectures or frameworks.