Stars
The simplest, fastest repository for training/finetuning small-sized VLMs.
This is example code for a LangGraph solution that uses a custom made toolkit.
Code Transformer neural network components piece by piece
JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welcome).
A highly optimized LLM inference acceleration engine for Llama and its variants.
lightweight, standalone C++ inference engine for Google's Gemma models.
Disaggregated serving system for Large Language Models (LLMs).
SGLang is a fast serving framework for large language models and vision language models.
Model Compression Toolbox for Large Language Models and Diffusion Models
Implement a ChatGPT-like LLM in PyTorch from scratch, step by step
Video+code lecture on building nanoGPT from scratch
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including CUDA, x86 and ARMv9.
Deep learning inference nodes for ROS / ROS2 with support for NVIDIA Jetson and TensorRT
Agent framework and applications built upon Qwen>=3.0, featuring Function Calling, MCP, Code Interpreter, RAG, Chrome extension, etc.
haileyschoelkopf / vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
Adlik / smoothquantplus
Forked from mit-han-lab/smoothquant[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
TinyChatEngine: On-Device LLM Inference Library
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with …
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉