Lists (1)
Sort Name ascending (A-Z)
Stars
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
flash attention tutorial written in python, triton, cuda, cutlass
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
A lightweight design for computation-communication overlap.
Implement Flash Attention using Cute.
Ring attention implementation with flash attention
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
Applied AI experiments and examples for PyTorch