Skip to content

Lossfunk/Flash-Kernels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flash-Kernels

A curated collection of highly-optimized CUDA kernels produced by a multi-agent code-generation system.

Repository layout

Flash-Kernels/
├── kernels/              # Raw CUDA source for each standalone kernel
│   ├── common.cuh        # Shared device utilities (e.g. warp reductions)
│   ├── vectorized_layernorm.cu
│   ├── fused_linear_rowsum.cu
│   ├── softmax.cu
│   └── broadcast_multiply.cu
├── docs/
│   └── APPENDIX.md       # Full technical appendix from the original blog post
├── CMakeLists.txt        # Simple CUDA build script (optional)
└── .gitignore            # Standard ignores for build artefacts

Getting started

These kernels are provided as reference implementations. If you simply want to read the code, just open the files in kernels/.

To compile the kernels into a static library (libflash_kernels.a):

# From the repository root
cmake -B build .
cmake --build build -j

You will need:

  • CUDA 11.4+ (tested up to 12.3)
  • CMake >= 3.18
  • A recent C++17 compiler (gcc 10+, clang 12+, or MSVC 2019)

Feel free to open issues or PRs if you port these kernels to other GPU architectures or frameworks.

About

because life's too short for slow kernels.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published