Xuweijia-buaa

Follow

Xuweijia-buaa

Follow

7 followers · 22 following

Stars

NVlabs / Fast-dLLM

Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"

Python 708 57 Updated Oct 23, 2025

Jikai0Wang / Speculative_CoT

Python 19 Updated May 14, 2025

fla-org / flash-linear-attention

🚀 Efficient implementations of state-of-the-art linear attention models

Python 3,931 313 Updated Nov 27, 2025

CalebDu / Awesome-Cute

C++ 112 16 Updated May 16, 2025

NVlabs / GatedDeltaNet

[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule

Python 379 23 Updated Sep 15, 2025

haoliuhl / ringattention

Large Context Attention

Python 753 52 Updated Oct 13, 2025

volcengine / verl

verl: Volcano Engine Reinforcement Learning for LLMs

Python 16,774 2,668 Updated Nov 27, 2025

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 4,141 583 Updated Nov 28, 2025

QwenLM / Qwen3

Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.

Python 25,543 1,787 Updated Oct 13, 2025

neuralmagic / nm-vllm

Forked from vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 267 10 Updated Oct 11, 2024

bytedance / flux

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,181 85 Updated Aug 28, 2025

HKUNLP / ChunkLlama

[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"

Python 441 22 Updated Oct 16, 2024

coder2gwy / coder2gwy

互联网首份程序员考公指南，由3位已经进入体制内的前大厂程序员联合献上。

27,474 3,725 Updated Feb 11, 2022

deepseek-ai / DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

4,969 532 Updated Sep 25, 2024

66RING / tiny-flash-attention

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 453 50 Updated May 14, 2025

xlite-dev / Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 4,759 324 Updated Nov 28, 2025

NVIDIA / nvbandwidth

A tool for bandwidth measurements on NVIDIA GPUs.

C++ 573 66 Updated Apr 15, 2025

Infini-AI-Lab / Sequoia

scalable and robust tree-based speculative decoding algorithm

Python 363 37 Updated Jan 28, 2025

reed-lau / cute-gemm

C++ 150 42 Updated Nov 9, 2025

ROCm / flash-attention

Forked from Dao-AILab/flash-attention

Fast and memory-efficient exact attention

Python 201 68 Updated Oct 20, 2025

thunlp / Ouroboros

Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)

Python 112 9 Updated Mar 20, 2025

google / gemma_pytorch

The official PyTorch implementation of Google's Gemma models

Python 5,576 564 Updated May 30, 2025

apoorvumang / prompt-lookup-decoding

Jupyter Notebook 581 25 Updated Aug 23, 2024

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Python 20,462 3,545 Updated Nov 28, 2025

triton-lang / triton

Development repository for the Triton language and compiler

MLIR 17,689 2,411 Updated Nov 27, 2025

ModelTC / LightLLM

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 3,757 284 Updated Nov 27, 2025

teelinsan / parallel-decoding

Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"

Python 121 8 Updated Mar 15, 2024

hao-ai-lab / LookaheadDecoding

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,306 78 Updated Mar 6, 2025

pytorch / extension-cpp

C++ extensions in PyTorch

Python 1,164 248 Updated Jul 8, 2025

mit-han-lab / streaming-llm

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 7,132 395 Updated Jul 11, 2024