Awesome AI Everything

AI LLM Survey

A Survey on Test-Time Scaling in Large Language Models:What, How, Where, and How Well
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Thus Spake Long-Context Large Language Model |24 Feb 2025| Shanghai AI Lab&huawei&fudan
A Survey on Inference Optimization Techniques for Mixture of Experts Models |Dec 2024 | CUHK Univresity&shang hai jiao tong University
A Survey on Mixture of Experts | 8 Aug 2024 | UK Univesity
A Survey of Low-bit Large Language Models:Basics, Systems, and Algorithms | 30 Sep 2024 | Beihang University&ETH Zuric&SenseTime&CUHK
A Survey on Efficient Inference for Large Language Models | 22 Apr 2024 | Infinigence-AI
Efficient Large Language Models: A Survey | 23 May 2024 | AWS
Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models | 1 Jan 2024 |
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems | 23 Dec 2023 | CMU
Challenges and Applications of Large Language Models | 19 Jul 2023 | University College London
Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

AI State of Art Model

paper&technical paper

deepseek r1 | 22 Jan 2025 | DeepSeek
deepseek v3 |26 Dec 2024 | DeepSeek

AI benchmark

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators | 31 Oct 2024 | Argonne National Laboratory

AI Algorithm

Network Architecture

Transformer: Attention Is All You Need | 2 Aug 2023 | Google
RWKV: Reinventing RNNs for the Transformer Era | 11 Dec 2023 | Generative AI Commons
Mamba: Linear-Time Sequence Modeling with Selective State Spaces | 31 May 2024 | CMU

MoE

AI Chips

Chips Survey

GPU

NVIDIA

#	FLOPs dense fp16	HBM	Bandwidth	L2 cache	NV link	PCIe	Architecture
GB200	5P	384GB	8.0TB/s		1.8TB/s	128GB/s	blackwell
GH200	985T	141GB	4.8TB/s	60MB	900GB/s	128GB/s	hopper
H100	985T	80GB	3.35TB/s	50MB	900GB/s	128GB/s	hopper
H800	985T	80GB	3.35TB/s	50MB	400GB/s	64GB/s	hopper
A100	312T	80GB	2.0TB/s	40MB	600GB/s	64GB/s	ampere
A800	312T	80GB	2.0TB/s	80MB	400GB/s	128GB/s	ampere
H20	148T	141GB	4.0TB/s	60MB	900GB/s	128GB/s	hopper
L40s	362T	48GB	846GB/s	96MB	/	64GB/s	Ada Lovelace
4090	330T	24GB	1.0TB/s	72MB	/	64GB/s	Ada Lovelace

AMD
MI400X
MI350X
MI325X
MI300X
BIREN
Enflame
MOORE

ASIC

gaudi3
trainium
D-matrix
sohu
tensortorrent
sambanova RDU
Groq LDU
cerebras
graphcore IPU
google tpu v4: TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings
google tpu v3: Scale MLPerf-0.6 models on Google TPU-v3 Pods
google tpu v2: Image Classification at Supercomputer Scale
google tpu v1: In-Datacenter Performance Analysis of a Tensor Processing Unit

FPGA

PIM/NDP

AI Training Optimization

PP

DP

FSDP

SP/CP

CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts | 23 Sep 2024 | University of Virginia
TRAINING ULTRA LONG CONTEXT LANGUAGE MODEL WITH FULLY PIPELINED DISTRIBUTED TRANSFORMER | 30 Aug 2024 | Microsoft
ring attention
Ulysses

EP

deepEP | deepseek
megascale infer: | 7 Apr 2025 | byte dance seed

AI Inference optimization

KV cache optimization

Efficiently Scaling Transformer Inference | Nov 2022 | Google
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

Quantization

Pruning

Decomposition

ESPACE: Dimensionality Reduction of Activations for Model Compression

Distilling

SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs

Sparse

Fusion

FFN FUSION: RETHINKING SEQUENTIAL COMPUTATION IN LARGE LANGUAGE MODELS

Heterogeneous Speculative Decoding

Overlapping

Communication & Compute: tensor parallelism & communication

CuBLASMp | nvidia
FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation | 28 Apr 2025 | pk&infini-ai
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler | 4 May 2025 | PK&seed
TileLink: Generating Efficient Compute-Communicatio
NanoFlow: Towards Optimal Large Language Model Serving Throughput | 22 Aug 2024 | UW
FLUX: FAST SOFTWARE-BASED COMMUNICATION OVERLAP ON GPUS THROUGH KERNEL FUSION | 23 Oct 2024 | ByteDance
ISO: Overlap of Computation and Communication within Seqenence For LLM Inference | 4 Sep 2024 | Baichuan inc
Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping | 23 Sep 2024 |Microsoft
T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives | 30 Jan 2024 | AMD

MoE: overlapping of alltoall & compute & inference system

MoE route

Dynamic Language Group-based MoE: Enhancing Code-Switching Speech Recognition with Hierarchical Routing

Offloading

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization | |
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs | 18 Nov 2024 | UC Berkeley
ZeRO-offload++ |November 2023| Microsoft
ZeRO-Offload: Democratizing Billion-Scale Model Training | 18 Jan 2021 | Microsoft
Efficient and Economic Large Language Model Inference with Attention Offloading | 3 May 2024 | TsingHua University
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference | 8 Sep 2024 |
Neo: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference | 2 Nov 2024 | PeKing University
Fast Inference of Mixture-of-Experts Language Models with Offloading | 28 Dec 2023 | Moscow Institute of Physics and Technology
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference |6 Nov 2024 | shanghai jiaotong University&CUHK
FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference
MOE-INFINITY: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache | 12 Mar 2025 | The University of Edinburgh
Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System | 11 Mar 2024 |
Glinthawk: A Two-Tiered Architecture for Offline LLM Inference | 11 Feb 2025 | microsoft
Fast State Restoration in LLM Serving with HCache
Efficient and Economic Large Language Model Inference with Attention Offloading

hybrid batches

POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

Parameter Sharing

LORA

DoRA: Weight-Decomposed Low-Rank Adaptation

Attention optimization

The quadratic complexity of self-attention in a vanilla Transformer is well-known, and there has been much research on how to optimize attention to a linear-complexity algorithm.

1.Efficient attention algorithms: the use of faster, modified memory-efficient attention algorithms

2.Removing some attention heads, called attention head pruning

3.Approximate attention

4.Next-gen architectures

5.Code optimizations: op level rewriting optimizaiton

FLEX ATTENTION: A PROGRAMMING MODEL FOR GENERATING OPTIMIZED ATTENTION KERNELS | 7 Dec 2024 | Meta
STAR ATTENTION: EFFICIENT LLM INFERENCE OVER LONG SEQUENCES | 26 Nov 2024 | Nvidia
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (MLA) |19 Jun 2024 | Deepseek
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision | 12 Jul 2024 |
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | 17 Jul 2023 | Princeton University&Stanford University
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | 23 Jun 2022 | stanford
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | 23 Dec 2023 | Google
Fast Transformer Decoding: One Write-Head is All You Need (MQA) |6 Nov 2019 | Google
Multi-Head Attention: Collaborate Instead of Concatenate | 20 May 2021 |
Efficient Memory Management for Large Language Model Serving with PagedAttention | 12 Sep 2023 | UC Berkeley
Slim attention | 7 Mar 2025 | openmachine
Do We Really Need the KVCache for All Large Language Models | blog
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs | 17 Feb 2025 | UHK&UW&Microsoft&nvidia

MHA2MLA

Prefill Decode Disaggregated

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving | 9 Jul 2024 | Moonshot AI
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool | 26 Jun 2024 | huawei cloud
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads | 20 Jan 2024 | huawei cloud
Splitwise: Efficient Generative LLM Inference Using Phase Splitting | 20 May 2024 | Microsoft
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | 6 Jun 2024 | PK&StepFun
semi-PD: TOWARDS EFFICIENT LLM SERVING VIA PHASE-WISE DISAGGREGATED COMPUTATION AND UNIFIED STORAGE

Prefill optimization

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | 31 Aug 2023 | Microsoft
Cachegen: Fast context loading for language model applications (prefix KV cache) |August 2024 | Microsoft
Fast and Expressive LLM Inference with RadixAttention and SGLang | 6 Jun 2024 | Stanford University

DeepSeek Open Day

https://github.com/deepseek-ai/open-infra-index?tab=readme-ov-file#day-6---one-more-thing-deepseek-v3r1-inference-system-overview

AI Inference&serving Framework

DeepSpeed | Microsoft
TensorRT-LLM | Nvidia
Accelerate | Hugging Face
Ray-LLM | Ray
LLaVA
Megatron-LM | Nvidia
NeMo | Nvidia
torchtitan | PyTorch
vLLM | UCB
SGLang | UCB
llama.cpp
PRIMA.CPP

AI Complier

The Deep Learning Compiler: A Comprehensive Survey | 28 Aug 2020 | Beihang University&Tsinghua University
MLIR: A Compiler Infrastructure for the End of Moore’s Law | 1 Mar 2020 | Google
XLA | 2017 | Google
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning | 5 Oct 2018 | University of Washington

AI Infrastructure

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs | 23 Feb 2024 | ByteDance
OPT: Open Pre-trained Transformer Language Models | 21 Jun 2022 | Meta
From bare metal to a 70B model: infrastructure set-up and scripts | Imbue
Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures | 14 May 2025 | DeepSeek

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md

StarStyleSky/AI-workspace

Folders and files

Latest commit

History

Repository files navigation