- Guangzhou, China
-
08:13
(UTC +08:00) - https://github.com/xlite-dev
-
-
cache-dit Public
Forked from vipshop/cache-dit🤗CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers
Python Other UpdatedJun 18, 2025 -
SpargeAttn Public
Forked from thu-ml/SpargeAttnSpargeAttention: A training-free sparse attention that can accelerate any model inference.
Cuda Apache License 2.0 UpdatedMay 11, 2025 -
CUDA-Learn-Notes Public
Forked from xlite-dev/LeetCUDA📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
-
sglang Public
Forked from sgl-project/sglangSGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
-
MInference Public
Forked from microsoft/MInference[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …
-
vllm Public
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
-
lite.ai.toolkit Public
Forked from xlite-dev/lite.ai.toolkit🛠 A lite C++ toolkit: contains 100+ Awesome AI models, support MNN, NCNN, TNN, ONNXRuntime and TensorRT. 🎉🎉
-
ffpa-attn-mma Public
Forked from xlite-dev/ffpa-attn📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
-
Awesome-LLM-Inference Public
Forked from xlite-dev/Awesome-LLM-Inference📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
-
hgemm-mma Public
Forked from xlite-dev/HGEMM⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
-
TensorRT-LLM Public
Forked from NVIDIA/TensorRT-LLMTensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
-
chain-of-draft Public
Forked from sileix/chain-of-draftCode and data for the Chain-of-Draft (CoD) paper
Python UpdatedMar 11, 2025 -
FlashMLA Public
Forked from deepseek-ai/FlashMLAFlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs
-
llm-compressor Public
Forked from vllm-project/llm-compressorTransformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
-
MHA2MLA Public
Forked from JT-Ushio/MHA2MLATowards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
-
xDiT Public
Forked from xdit-project/xDiTxDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Python Apache License 2.0 UpdatedFeb 14, 2025 -
unlock-deepseek Public
Forked from datawhalechina/unlock-deepseekDeepSeek 系列工作解读、扩展和复现。
Python UpdatedFeb 7, 2025 -
ParaAttention Public
Forked from chengzeyi/ParaAttentionContext parallel attention that accelerates DiT model inference with dynamic caching
Python Other UpdatedJan 3, 2025 -
InternVL Public
Forked from OpenGVLab/InternVL[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
-
cutlass Public
Forked from NVIDIA/cutlassCUDA Templates for Linear Algebra Subroutines
-
flash-attention Public
Forked from Dao-AILab/flash-attentionFast and memory-efficient exact attention
-
lmdeploy Public
Forked from InternLM/lmdeployLMDeploy is a toolkit for compressing, deploying, and serving LLM
Python Apache License 2.0 UpdatedNov 15, 2024 -
CogVideo Public
Forked from THUDM/CogVideotext and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Python Apache License 2.0 UpdatedNov 7, 2024 -
cuda_hgemm Public
Forked from Bruce-Lee-LY/cuda_hgemmSeveral optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Cuda MIT License UpdatedSep 8, 2024 -
triton Public
Forked from triton-lang/tritonDevelopment repository for the Triton language and compiler
-
llm-action Public
Forked from liguodongiot/llm-action本项目旨在分享大模型相关技术原理以及实战经验。
-
TensorRT Public
Forked from NVIDIA/TensorRTNVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applicat…
C++ Apache License 2.0 UpdatedJul 15, 2024 -
TensorRT-Model-Optimizer Public
Forked from NVIDIA/TensorRT-Model-OptimizerTensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frame…
-