High-performance GEMM implementation optimized for NVIDIA H100 GPUs, leveraging Hopper architecture's TMA, WGMMA, and Thread Block Clusters for near-peak theoretical performance.

Cuda 8 Updated Dec 4, 2024

HuyNguyen-hust / hopper-gemm-101

Cuda 10 1 Updated Dec 22, 2024

pzhao-eng / FlashMLA

C++ 56 11 Updated Jul 17, 2025

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 3,647 269 Updated Oct 20, 2025

pranjalssh / fast.cu

Fastest kernels written from scratch

Cuda 375 53 Updated Sep 18, 2025

yifuwang / symm-mem-recipes

Python 140 10 Updated Dec 27, 2024

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 8,067 798 Updated Oct 17, 2025

meta-pytorch / applied-ai

Applied AI experiments and examples for PyTorch

Python 299 29 Updated Aug 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hongwei Chen hwchen2017

Achievements

Achievements

Block or report hwchen2017

Lists (1)

Computational Physics

Stars

zhuzilin / flash-attention-with-sink

mirage-project / mirage

leimao / CUTLASS-Examples

66RING / tiny-flash-attention

aryagxr / cuda

anilshanbhag / gpu-topk

Oneflow-Inc / oneflow

Dao-AILab / quack

GeeeekExplorer / nano-vllm

infinigence / FlashOverlap

ngocson2vn / learncuda

simveit / effective_transpose

luliyucoordinate / cute-flash-attention

ALPSim / ALPS

zhuzilin / ring-flash-attention

databricks / megablocks

Bruce-Lee-LY / cuda_hgemm

wdndev / llama3-from-scratch-zh

perplexityai / pplx-kernels

bertmaher / simplegemm

rchardx / cuda-gemm

uxlfoundation / oneMath

Faraz9877 / H100_GEMM