-
A Survey on Test-Time Scaling in Large Language Models:What, How, Where, and How Well
-
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
-
Thus Spake Long-Context Large Language Model |24 Feb 2025| Shanghai AI Lab&huawei&fudan
-
A Survey on Inference Optimization Techniques for Mixture of Experts Models |Dec 2024 | CUHK Univresity&shang hai jiao tong University
-
A Survey on Mixture of Experts | 8 Aug 2024 | UK Univesity
-
A Survey of Low-bit Large Language Models:Basics, Systems, and Algorithms | 30 Sep 2024 | Beihang UniversityÐ Zuric&SenseTime&CUHK
-
A Survey on Efficient Inference for Large Language Models | 22 Apr 2024 | Infinigence-AI
-
Efficient Large Language Models: A Survey | 23 May 2024 | AWS
-
Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models | 1 Jan 2024 |
-
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems | 23 Dec 2023 | CMU
-
Challenges and Applications of Large Language Models | 19 Jul 2023 | University College London
-
Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques
paper&technical paper
- deepseek r1 | 22 Jan 2025 | DeepSeek
- deepseek v3 |26 Dec 2024 | DeepSeek
- LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators | 31 Oct 2024 | Argonne National Laboratory
- Transformer: Attention Is All You Need | 2 Aug 2023 | Google
- RWKV: Reinventing RNNs for the Transformer Era | 11 Dec 2023 | Generative AI Commons
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces | 31 May 2024 | CMU
Chips Survey
- 2024 LARGE LANGUAGE MODEL INFERENCE ACCELERATION:A COMPREHENSIVE HARDWARE PERSPECTIVE
- 2023 Lincoln AI Computing Survey (LAICS) Update
- 2022 AI and ML Accelerator Survey and Trendse
- 2021 AI Accelerator Survey and Trends
- 2020 Survey of Machine Learning Accelerators
- 2019 Survey and Benchmarking of Machine Learning Accelerators
- NVIDIA
# | FLOPs dense fp16 | HBM | Bandwidth | L2 cache | NV link | PCIe | Architecture |
---|---|---|---|---|---|---|---|
GB200 | 5P | 384GB | 8.0TB/s | 1.8TB/s | 128GB/s | blackwell | |
GH200 | 985T | 141GB | 4.8TB/s | 60MB | 900GB/s | 128GB/s | hopper |
H100 | 985T | 80GB | 3.35TB/s | 50MB | 900GB/s | 128GB/s | hopper |
H800 | 985T | 80GB | 3.35TB/s | 50MB | 400GB/s | 64GB/s | hopper |
A100 | 312T | 80GB | 2.0TB/s | 40MB | 600GB/s | 64GB/s | ampere |
A800 | 312T | 80GB | 2.0TB/s | 80MB | 400GB/s | 128GB/s | ampere |
H20 | 148T | 141GB | 4.0TB/s | 60MB | 900GB/s | 128GB/s | hopper |
L40s | 362T | 48GB | 846GB/s | 96MB | / | 64GB/s | Ada Lovelace |
4090 | 330T | 24GB | 1.0TB/s | 72MB | / | 64GB/s | Ada Lovelace |
- gaudi3
- trainium
- D-matrix
- sohu
- tensortorrent
- sambanova RDU
- Groq LDU
- cerebras
- graphcore IPU
- google tpu v4: TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings
- google tpu v3: Scale MLPerf-0.6 models on Google TPU-v3 Pods
- google tpu v2: Image Classification at Supercomputer Scale
- google tpu v1: In-Datacenter Performance Analysis of a Tensor Processing Unit
- PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference
- Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM
- LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS |16 Oct 2021 | microsoft&cmu
- WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training | 23 Mar 2025 | Meta&University of California, San Diego
- Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
- Training Multi-Billion Parameter Language Models Using Model Parallelism(Megatron-LM) | 13 Mar 2020 | Nvidia
- CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts | 23 Sep 2024 | University of Virginia
- TRAINING ULTRA LONG CONTEXT LANGUAGE MODEL WITH FULLY PIPELINED DISTRIBUTED TRANSFORMER | 30 Aug 2024 | Microsoft
- ring attention
- Ulysses
-
deepEP | deepseek
-
megascale infer: | 7 Apr 2025 | byte dance seed
- Efficiently Scaling Transformer Inference | Nov 2022 | Google
- Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
- COMET: Towards Practical W4A4KV4 LLMs Serving
- LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
- Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference
- DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Communication & Compute: tensor parallelism & communication
- CuBLASMp | nvidia
- FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation | 28 Apr 2025 | pk&infini-ai
- Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler | 4 May 2025 | PK&seed
- TileLink: Generating Efficient Compute-Communicatio
- NanoFlow: Towards Optimal Large Language Model Serving Throughput | 22 Aug 2024 | UW
- FLUX: FAST SOFTWARE-BASED COMMUNICATION OVERLAP ON GPUS THROUGH KERNEL FUSION | 23 Oct 2024 | ByteDance
- ISO: Overlap of Computation and Communication within Seqenence For LLM Inference | 4 Sep 2024 | Baichuan inc
- Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping | 23 Sep 2024 |Microsoft
- T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives | 30 Jan 2024 | AMD
MoE: overlapping of alltoall & compute & inference system
- LANCET: ACCELERATING MIXTURE-OF-EXPERTS TRAINING VIA WHOLE G RAPH C OMPUTATION -C OMMUNICATION O VERLAPPING | 30 Apr 2024 | AWS
- TUTEL: ADAPTIVE MIXTURE-OF-EXPERTS AT SCALE | 5 Jun 2023 | Microsoft
- MegaScale infer | byte dance seed
- Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
- Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
- MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators
- Delta Decompression for MoE-based LLMs Compression
- Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
- DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference
- PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization | |
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs | 18 Nov 2024 | UC Berkeley
- ZeRO-offload++ |November 2023| Microsoft
- ZeRO-Offload: Democratizing Billion-Scale Model Training | 18 Jan 2021 | Microsoft
- Efficient and Economic Large Language Model Inference with Attention Offloading | 3 May 2024 | TsingHua University
- InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference | 8 Sep 2024 |
- Neo: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference | 2 Nov 2024 | PeKing University
- Fast Inference of Mixture-of-Experts Language Models with Offloading | 28 Dec 2023 | Moscow Institute of Physics and Technology
- HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference |6 Nov 2024 | shanghai jiaotong University&CUHK
- FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference
- MOE-INFINITY: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache | 12 Mar 2025 | The University of Edinburgh
- Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System | 11 Mar 2024 |
- Glinthawk: A Two-Tiered Architecture for Offline LLM Inference | 11 Feb 2025 | microsoft
- Fast State Restoration in LLM Serving with HCache
- Efficient and Economic Large Language Model Inference with Attention Offloading
The quadratic complexity of self-attention in a vanilla Transformer is well-known, and there has been much research on how to optimize attention to a linear-complexity algorithm.
1.Efficient attention algorithms: the use of faster, modified memory-efficient attention algorithms
2.Removing some attention heads, called attention head pruning
3.Approximate attention
4.Next-gen architectures
5.Code optimizations: op level rewriting optimizaiton
- FLEX ATTENTION: A PROGRAMMING MODEL FOR GENERATING OPTIMIZED ATTENTION KERNELS | 7 Dec 2024 | Meta
- STAR ATTENTION: EFFICIENT LLM INFERENCE OVER LONG SEQUENCES | 26 Nov 2024 | Nvidia
- DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (MLA) |19 Jun 2024 | Deepseek
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision | 12 Jul 2024 |
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | 17 Jul 2023 | Princeton University&Stanford University
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | 23 Jun 2022 | stanford
- GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | 23 Dec 2023 | Google
- Fast Transformer Decoding: One Write-Head is All You Need (MQA) |6 Nov 2019 | Google
- Multi-Head Attention: Collaborate Instead of Concatenate | 20 May 2021 |
- Efficient Memory Management for Large Language Model Serving with PagedAttention | 12 Sep 2023 | UC Berkeley
- Slim attention | 7 Mar 2025 | openmachine
- Do We Really Need the KVCache for All Large Language Models | blog
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs | 17 Feb 2025 | UHK&UW&Microsoft&nvidia
- Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs
- TransMLA: Multi-Head Latent Attention Is All You Need
- X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving | 9 Jul 2024 | Moonshot AI
- MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool | 26 Jun 2024 | huawei cloud
- Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads | 20 Jan 2024 | huawei cloud
- Splitwise: Efficient Generative LLM Inference Using Phase Splitting | 20 May 2024 | Microsoft
- DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | 6 Jun 2024 | PK&StepFun
- semi-PD: TOWARDS EFFICIENT LLM SERVING VIA PHASE-WISE DISAGGREGATED COMPUTATION AND UNIFIED STORAGE
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | 31 Aug 2023 | Microsoft
- Cachegen: Fast context loading for language model applications (prefix KV cache) |August 2024 | Microsoft
- Fast and Expressive LLM Inference with RadixAttention and SGLang | 6 Jun 2024 | Stanford University
- DeepSpeed | Microsoft
- TensorRT-LLM | Nvidia
- Accelerate | Hugging Face
- Ray-LLM | Ray
- LLaVA
- Megatron-LM | Nvidia
- NeMo | Nvidia
- torchtitan | PyTorch
- vLLM | UCB
- SGLang | UCB
- llama.cpp
- PRIMA.CPP
- The Deep Learning Compiler: A Comprehensive Survey | 28 Aug 2020 | Beihang University&Tsinghua University
- MLIR: A Compiler Infrastructure for the End of Moore’s Law | 1 Mar 2020 | Google
- XLA | 2017 | Google
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning | 5 Oct 2018 | University of Washington
- MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | 23 Feb 2024 | ByteDance
- OPT: Open Pre-trained Transformer Language Models | 21 Jun 2022 | Meta
- From bare metal to a 70B model: infrastructure set-up and scripts | Imbue
- Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures | 14 May 2025 | DeepSeek