Skip to content
View hwchen2017's full-sized avatar
  • Microsoft

Block or report hwchen2017

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Mirage Persistent Kernel: Compiling LLMs into a MegaKernel

C++ 1,891 138 Updated Oct 20, 2025

CUTLASS and CuTe Examples

Cuda 91 13 Updated Oct 17, 2025

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 431 46 Updated May 14, 2025

coding CUDA everyday!

Cuda 64 2 Updated Apr 19, 2025

Efficient Top-K implementation on the GPU

Cuda 188 22 Updated Apr 9, 2019

OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.

C++ 9,365 1,010 Updated Aug 20, 2025

A Quirky Assortment of CuTe Kernels

Python 633 50 Updated Oct 11, 2025

Nano vLLM

Python 7,121 911 Updated Aug 31, 2025

A lightweight design for computation-communication overlap.

Cuda 181 8 Updated Oct 10, 2025

Learning CUDA

C++ 4 Updated Oct 15, 2025

Effective transpose on Hopper GPU

Cuda 25 3 Updated Sep 6, 2025

Implement Flash Attention using Cute.

Cuda 96 8 Updated Dec 17, 2024
C++ 17 4 Updated Oct 17, 2025

Ring attention implementation with flash attention

Python 901 87 Updated Sep 10, 2025
Python 1,465 215 Updated Jun 26, 2025

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda 484 86 Updated Sep 8, 2024

从零实现一个 llama3 中文版

Jupyter Notebook 980 97 Updated Jun 12, 2024

Perplexity GPU Kernels

C++ 495 65 Updated Sep 19, 2025
Cuda 121 16 Updated Mar 17, 2025
C++ 31 5 Updated Apr 1, 2025

oneAPI Math Library (oneMath)

C++ 720 172 Updated Oct 9, 2025

High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.

Cuda 8 Updated Dec 4, 2024
C++ 56 11 Updated Jul 17, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 3,647 269 Updated Oct 20, 2025

Fastest kernels written from scratch

Cuda 375 53 Updated Sep 18, 2025
Python 140 10 Updated Dec 27, 2024

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,067 798 Updated Oct 17, 2025

Applied AI experiments and examples for PyTorch

Python 299 29 Updated Aug 22, 2025
Next