Skip to content

ttltwlj/Awesome-LLM-System-Papers

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 

Repository files navigation

Awesome-LLM-System-Papers

This is a list of (non-comprehensive) LLM system papers maintained by ALCHEM Lab. Welcome to create a pull requst or an issue if we have missed any interesting papers!

Algorithm-System Co-Design

  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (JMLR'21) link to paper
  • Scalable and Efficient MoE Training for Multitask Multilingual Models (arXiv'21) link to paper
  • DeepSpeed-MOE: Advancing Mixture of Experts Inference and Training to Power Next-Generation AI Scale (ICML'22) link to paper

LLM Inference (Serving) Systems

Single-GPU Systems

  • TurboTransformers: An Efficient GPU Serving System For Transformer Models (PPoPP'21) link to paper
  • PetS: A Unified Framework for Parameter-Efficient Transformers Serving (ATC'22) link to paper

Distributed Systems

  • Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI'22) link to paper
  • DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale (SC'22) link to paper
  • EnergeonAI: An Inference System for 10-100 Billion Parameter Transformer Models (arXiv'22) link to paper
  • PETALS: Collaborative Inference and Fine-tuning of Large Models (NeurIPS'22 Workshop WBRC) link to paper
  • SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (preprint'23) link to paper
  • Fast Distributed Inference Serving for Large Language Models (arXiv'23) link to paper
  • An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs (arXiv'23) link to paper
  • Accelerating LLM Inference with Staged Speculative Decoding (arXiv'23) link to paper
  • Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP'23) link to paper

LLM Training Systems

Single-GPU Systems

  • CRAMMING: Training a Language Model on a Single GPU in One Day (arXiv'22) link to paper
  • Easy and Efficient Transformer : Scalable Inference Solution For large NLP model (arXiv'22) link to paper
  • High-throughput Generative Inference of Large Language Models with a Single GPU (arXiv'23) link to paper
  • ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs (arXiv'23) link to paper

Distributed Systems

  • ZeRO: Memory optimizations Toward Training Trillion Parameter Models (SC'20) link to paper
  • Megatron-lm: Training multi-billion parameter language models using model parallelism (arXiv'20) link to paper
  • PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models Algorithm (ICML'21) link to paper
  • Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (SC'21) link to paper
  • TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models (ICML'21) link to paper
  • FastMoE: A Fast Mixture-of-Expert Training System (arXiv'21) link to paper
  • Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (arXiv'22) link to paper
  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI'22) link to paper
  • LightSeq2: Accelerated Training for Transformer-Based Models on GPUs (SC'22) link to paper
  • Pathways: Asynchronous Distributed Dataflow for ML (arXiv'22) link to paper
  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (NeurIPS'22) link to paper
  • Varuna: Scalable, Low-cost Training of Massive Deep Learning Models (EuroSys'22) link to paper
  • FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models (PPoPP'22) link to paper
  • PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing (arXiv'23) link to paper
  • Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers (ASPLOS'23) link to paper
  • Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression(ASPLOS'23) link to paper
  • ZeRO++: Extremely Efficient Collective Communication for Giant Model Training (arXiv'23) link to paper
  • A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training (ICS'23) link to paper
  • BPIPE: Memory-Balanced Pipeline Parallelism for Training Large Language Models (ICML'23) link to paper
  • Optimized Network Architectures for Large Language Model Training with Billions of Parameters (arXiv'23) link to paper
  • SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient (arXiv'23) link to paper
  • Blockwise Parallel Transformer for Large Context Models (NeurIPS'23) link to paper
  • Ring Attention with Blockwise Transformers for Near-Infinite Context (arXiv'23) link to paper

General MLSys-Related Techniques (Not Complete)

  • Efficient GPU Spatial-Temporal Multitasking (TPDS'14) link to paper
  • Enabling preemptive multiprogramming on GPUs (ISCA'14) link to paper
  • Chimera: Collaborative Preemption for Multitasking on a Shared GPU (ASPLOS'15) link to paper
  • Simultaneous Multikernel GPU: Multi-tasking Throughput Processors via Fine-Grained Sharing (HPCA'16) link to paper
  • FLEP: Enabling Flexible and Efficient Preemption on GPUs (ASPLOS'17) link to paper
  • Dynamic Resource Management for Efficient Utilization of Multitasking GPUs (ASPLOS'17) link to paper
  • Mesh-TensorFlow: Deep Learning for Supercomputers (NeurIPS'18) link to paper
  • PipeDream: Fast and Efficient Pipeline Parallel DNN Training (SOSP'19) link to paper
  • GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism (NeurIPS'19) link to paper
  • PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications (OSDI'20) link to paper
  • Microsecond-scale Preemption for Concurrent GPU-accelerated DNN Inferences (OSDI'22) link to paper
  • Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models (ASPLOS'23) link to paper

LLM Algorithm Papers Recommended for System Researchers

  • Attention is all you need (NeurIPS'17) link to paper
  • Language Models are Unsupervised Multitask Learners (preprint from OpenAI) link to paper
  • Improving Language Understanding by Generative Pretraining (preprint from OpenAI) link to paper
  • Language Models are Few-Shot Learners (NeurIPS'20) link to paper
  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (JMLR'20) link to paper
  • Multitask Prompted Training Enables Zero-Shot Task Generalization (ICLR'22) link to paper
  • Finetuned Language Models are Zero-Shot Learners (ICLR'22) link to paper
  • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (ICML'22) link to paper
  • LaMDA: Language Models for Dialog Applications (arXiv'22) link to paper
  • PaLM: Scaling Language Modeling with Pathways (arXiv'22) link to paper
  • OPT: Open Pre-trained Transformer Language Models (arXiv'22) link to paper
  • Holistic Evaluation of Language Models (arXiv'22) link to paper
  • BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (arXiv'23) link to paper
  • LLaMA: Open and Efficient Foundation Language Models (arXiv'23) link to paper
  • DeepMind: Training Compute Optimal Large Language Models (preprint from DeepMind) link to paper
  • Scaling Laws for Neural Language Models (preprint) link to paper
  • Scaling Language Models: Methods, Analysis & Insights from Training Gopher (preprint from DeepMind) link to paper
  • LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models (arXiv'23) link to paper
  • RWKV: Reinventing RNNs for the Transformer Era (arXiv'23) link to paper
  • LongNet: Scaling Transformers to 1,000,000,000 (arXiv'23)link to paper
  • SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference (arXiv'23)link to paper
  • FlashAttention2: Faster Attention with Better Parallelism and Work Partitioning (arXiv'23)link to paper
  • Retentive Network: A Successor to Transformers for Large Language Models (arXiv'23)link to paper
  • TransNormer: Scaling TransNormer to 175 Billion Parameters (arXiv'23)link to paper
  • Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding (arXiv'23)link to paper
  • From Sparse to Soft Mixture of Experts (arXiv'23)link to paper

Survyes

  • A Survey of Large Language Models (arXiv'23) link to paper
  • Challenges and Applications of Large Language Models (arXiv'23)link to paper

Other Useful Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published