Skip to content
View Xuweijia-buaa's full-sized avatar

Block or report Xuweijia-buaa

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Official implementation of "Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding"

Python 708 57 Updated Oct 23, 2025
Python 19 Updated May 14, 2025

🚀 Efficient implementations of state-of-the-art linear attention models

Python 3,931 313 Updated Nov 27, 2025
C++ 112 16 Updated May 16, 2025

[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule

Python 379 23 Updated Sep 15, 2025

Large Context Attention

Python 753 52 Updated Oct 13, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 16,774 2,668 Updated Nov 27, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 4,141 583 Updated Nov 28, 2025

Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.

Python 25,543 1,787 Updated Oct 13, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 267 10 Updated Oct 11, 2024

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 1,181 85 Updated Aug 28, 2025

[ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"

Python 441 22 Updated Oct 16, 2024

互联网首份程序员考公指南,由3位已经进入体制内的前大厂程序员联合献上。

27,474 3,725 Updated Feb 11, 2022

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

4,969 532 Updated Sep 25, 2024

flash attention tutorial written in python, triton, cuda, cutlass

Cuda 453 50 Updated May 14, 2025

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 4,759 324 Updated Nov 28, 2025

A tool for bandwidth measurements on NVIDIA GPUs.

C++ 573 66 Updated Apr 15, 2025

scalable and robust tree-based speculative decoding algorithm

Python 363 37 Updated Jan 28, 2025
C++ 150 42 Updated Nov 9, 2025

Fast and memory-efficient exact attention

Python 201 68 Updated Oct 20, 2025

Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)

Python 112 9 Updated Mar 20, 2025

The official PyTorch implementation of Google's Gemma models

Python 5,576 564 Updated May 30, 2025
Jupyter Notebook 581 25 Updated Aug 23, 2024

SGLang is a fast serving framework for large language models and vision language models.

Python 20,462 3,545 Updated Nov 28, 2025

Development repository for the Triton language and compiler

MLIR 17,689 2,411 Updated Nov 27, 2025

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 3,757 284 Updated Nov 27, 2025

Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"

Python 121 8 Updated Mar 15, 2024

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,306 78 Updated Mar 6, 2025

C++ extensions in PyTorch

Python 1,164 248 Updated Jul 8, 2025

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 7,132 395 Updated Jul 11, 2024
Next