DefTruth

🎯

#pragma unroll

DefTruth DefTruth

🎯

#pragma unroll

@xlite-dev, @vipshop, LeetCUDA.

1.8k followers · 155 following

@xlite-dev, @vipshop
Guangzhou, China
08:13 (UTC +08:00)
https://github.com/xlite-dev

Achievements

x4 x4 x4

Achievements

x4 x4 x4

Organizations

DefTruth Public

4 2 Updated Jun 23, 2025
cache-dit Public
Forked from vipshop/cache-dit

🤗CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers

Python Other Updated Jun 18, 2025
SpargeAttn Public
Forked from thu-ml/SpargeAttn

SpargeAttention: A training-free sparse attention that can accelerate any model inference.

Cuda Apache License 2.0 Updated May 11, 2025
CUDA-Learn-Notes Public
Forked from xlite-dev/LeetCUDA

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 26 5 GNU General Public License v3.0 Updated Apr 26, 2025
sglang Public
Forked from sgl-project/sglang

SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.

Python 1 Apache License 2.0 Updated Apr 26, 2025
MInference Public
Forked from microsoft/MInference

[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …

Python 1 MIT License Updated Apr 25, 2025
vllm Public
Forked from vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 2 1 Apache License 2.0 Updated Apr 11, 2025
lite.ai.toolkit Public
Forked from xlite-dev/lite.ai.toolkit

🛠 A lite C++ toolkit: contains 100+ Awesome AI models, support MNN, NCNN, TNN, ONNXRuntime and TensorRT. 🎉🎉

C++ 15 GNU General Public License v3.0 Updated Mar 30, 2025
ffpa-attn-mma Public
Forked from xlite-dev/ffpa-attn

📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.

Cuda 1 GNU General Public License v3.0 Updated Mar 30, 2025
Awesome-LLM-Inference Public
Forked from xlite-dev/Awesome-LLM-Inference

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉

8 1 GNU General Public License v3.0 Updated Mar 30, 2025
hgemm-mma Public
Forked from xlite-dev/HGEMM

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Cuda 5 GNU General Public License v3.0 Updated Mar 30, 2025
TensorRT-LLM Public
Forked from NVIDIA/TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 2 Apache License 2.0 Updated Mar 23, 2025
chain-of-draft Public
Forked from sileix/chain-of-draft

Code and data for the Chain-of-Draft (CoD) paper

Python Updated Mar 11, 2025
FlashMLA Public
Forked from deepseek-ai/FlashMLA

FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs

C++ 1 MIT License Updated Mar 1, 2025
llm-compressor Public
Forked from vllm-project/llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Python 1 Apache License 2.0 Updated Mar 1, 2025
MHA2MLA Public
Forked from JT-Ushio/MHA2MLA

Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs

Python 1 Apache License 2.0 Updated Feb 27, 2025
xDiT Public
Forked from xdit-project/xDiT

xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism

Python Apache License 2.0 Updated Feb 14, 2025
unlock-deepseek Public
Forked from datawhalechina/unlock-deepseek

DeepSeek 系列工作解读、扩展和复现。

Python Updated Feb 7, 2025
ParaAttention Public
Forked from chengzeyi/ParaAttention

Context parallel attention that accelerates DiT model inference with dynamic caching

Python Other Updated Jan 3, 2025
InternVL Public
Forked from OpenGVLab/InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

Python 1 MIT License Updated Dec 25, 2024
cutlass Public
Forked from NVIDIA/cutlass

CUDA Templates for Linear Algebra Subroutines

C++ 1 Other Updated Nov 30, 2024
flash-attention Public
Forked from Dao-AILab/flash-attention

Fast and memory-efficient exact attention

Python 1 BSD 3-Clause "New" or "Revised" License Updated Nov 26, 2024
lmdeploy Public
Forked from InternLM/lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLM

Python Apache License 2.0 Updated Nov 15, 2024
CogVideo Public
Forked from THUDM/CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

Python Apache License 2.0 Updated Nov 7, 2024
cuda_hgemm Public
Forked from Bruce-Lee-LY/cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

Cuda MIT License Updated Sep 8, 2024
triton Public
Forked from triton-lang/triton

Development repository for the Triton language and compiler

C++ 4 MIT License Updated Jul 21, 2024
llm-action Public
Forked from liguodongiot/llm-action

本项目旨在分享大模型相关技术原理以及实战经验。

HTML 1 1 Apache License 2.0 Updated Jul 17, 2024
TensorRT Public
Forked from NVIDIA/TensorRT

NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applicat…

C++ Apache License 2.0 Updated Jul 15, 2024
TensorRT-Model-Optimizer Public
Forked from NVIDIA/TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frame…

Python 2 Other Updated Jul 7, 2024
AutoFP8 Public
Forked from neuralmagic/AutoFP8

Python Apache License 2.0 Updated Jun 30, 2024

DefTruth DefTruth

Achievements

Achievements

Organizations

DefTruth Public

Uh oh!

cache-dit Public

Uh oh!

SpargeAttn Public

Uh oh!

CUDA-Learn-Notes Public

Uh oh!

sglang Public

Uh oh!

MInference Public

Uh oh!

vllm Public

Uh oh!

lite.ai.toolkit Public

Uh oh!

ffpa-attn-mma Public

Uh oh!

Awesome-LLM-Inference Public

Uh oh!

hgemm-mma Public

Uh oh!

TensorRT-LLM Public

Uh oh!

chain-of-draft Public

Uh oh!

FlashMLA Public

Uh oh!

llm-compressor Public

Uh oh!

MHA2MLA Public

Uh oh!

xDiT Public

Uh oh!

unlock-deepseek Public

Uh oh!

ParaAttention Public

Uh oh!

InternVL Public

Uh oh!

cutlass Public

Uh oh!

flash-attention Public

Uh oh!

lmdeploy Public

Uh oh!

CogVideo Public

Uh oh!

cuda_hgemm Public

Uh oh!

triton Public

Uh oh!

llm-action Public

Uh oh!

TensorRT Public

Uh oh!

TensorRT-Model-Optimizer Public

Uh oh!

AutoFP8 Public

Uh oh!