Stars
Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"
🚀 Efficient implementations of state-of-the-art linear attention models
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
verl: Volcano Engine Reinforcement Learning for LLMs
FlashInfer: Kernel Library for LLM Serving
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
neuralmagic / nm-vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
flash attention tutorial written in python, triton, cuda, cutlass
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
A tool for bandwidth measurements on NVIDIA GPUs.
scalable and robust tree-based speculative decoding algorithm
ROCm / flash-attention
Forked from Dao-AILab/flash-attentionFast and memory-efficient exact attention
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
The official PyTorch implementation of Google's Gemma models
SGLang is a fast serving framework for large language models and vision language models.
Development repository for the Triton language and compiler
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks