Stars
Exocompilation for productive programming of hardware accelerators
High-efficiency floating-point neural network inference operators for mobile, server, and Web
Hackable and optimized Transformers building blocks, supporting a composable construction.
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
Development repository for the Triton language and compiler
SGLang is a fast serving framework for large language models and vision language models.
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Official inference framework for 1-bit LLMs
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
A high-throughput and memory-efficient inference and serving engine for LLMs
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Example models using DeepSpeed
High-speed Large Language Model Serving for Local Deployment
Running large language models on a single GPU for throughput-oriented scenarios.
Disaggregated serving system for Large Language Models (LLMs).
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
amd / blis
Forked from flame/blisBLAS-like Library Instantiation Software Framework
Basic linear algebra subroutines for embedded optimization
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of lists of statically-shaped tensors, referred to as a Fractal…
TBLIS is a library and framework for performing tensor operations, especially tensor contraction, using efficient native algorithms.
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.