0% found this document useful (0 votes)

29 views8 pages

Overlap of Computation and Communication Within Seqenence For LLM Inference

Uploaded by

于晓飞

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views8 pages

Overlap of Computation and Communication Within Seqenence For LLM Inference

Uploaded by

于晓飞

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

ISO: Overlap of Computation and Communication

within Seqenence For LLM Inference

Bin Xiao1 Lei Su1

1
Baichuan Inc.
{xiaobin, sulei}@baichuan-inc.com
arXiv:2409.11155v1 [cs.DC] 4 Sep 2024

Abstract
In the realm of Large Language Model (LLM) inference, the inherent structure of
transformer models coupled with the multi-GPU tensor parallelism strategy leads
to a sequential execution of computation and communication. This results in sub-
stantial underutilization of computing resources during the communication phase.
To mitigate this inefficiency, various techniques have been developed to optimize
the use of computational power throughout the communication process. These
strategies primarily involve overlapping matrix computations and communications,
as well as interleaving micro-batches across different requests. Nonetheless, these
approaches either fall short of achieving ideal overlap or impose certain limitations
on their application. To overcome these challenges, this paper introduces a novel
strategy for computation-communication overlap that operates at the sequence
level. This method not only enhances the degree of overlap but also minimizes the
constraints on its applicability. Experimental evaluations conducted using 30b/70b
models have demonstrated significant improvements in efficiency. Specifically, the
proposed technique has been shown to reduce time consumption by approximately
35% on 4090 GPU and by roughly 15% on A800 GPU during the prefill stage of
LLM inference.

1 Introduction
Generative Large Language Models (LLMs), epitomized by groundbreaking architectures like GPT-
2[9], ChatGPT[1], and GPT-3[3], have revolutionized the landscape of artificial intelligence. These
sophisticated models exhibit remarkable versatility, seamlessly transitioning from crafting imaginative
narratives to engaging in lifelike dialogues with humans. Their deep comprehension of natural
language has elevated human-computer interaction, automating complex tasks that demand nuanced
contextual understanding.
As these models have grown in scale and expanded their input sequence lengths, a new challenge
has emerged: traditional hardware limitations prevent a single processing unit from handling the
immense model weights or kv cache sizes. In response, tensor parallelism[10] has been introduced,
a technique that divides both the model and its kv cache across multiple processing units, enabling
them to operate in parallel. Following a phase of parallel computation, a collective communication
step is undertaken to consolidate global information. However, this approach introduces a serial
dependency between computation and communication phases, leading to substantial underutilization
of computational resources during the communication intervals.
To address this issue, there are primarily two mainstream solutions, as illustrated in Figure 1 (b/c).
The first solution involves segmenting the matrix computation adjacent to communication into
multiple blocks to facilitate overlap [6, 7]. However, in numerous cases, the communication duration
exceeds that of the matrix operation, resulting in incomplete overlap. Concurrently overlapping
matrix computations with communication can introduce considerable additional computational costs.
In specific scenarios, these extra costs may even lead to the strategy yielding negative returns.
The second approach entails grouping multiple requests into two micro-batches and executing
computation-communication overlap between these micro-batches[5]. On one hand, it necessitates
waiting for at least two requests; on the other hand, the two requests must be as balanced as possible
to achieve effective overlap. Simultaneously, while this method can enhance overall throughput, it
also results in increased latency for individual requests.

Figure 1: Overview of (a) Orginal pipeline, (b) Gemm overlp, (c) Request overlap, (d) ISO overlap.

Addressing the previously mentioned challenges, we introduce a strategy that groups micro-batches
within a sequence, facilitating overlap between the two, as depicted in Figure 1 (d). We refer to
this method as ISO (Intra-Sequence Overlap). This approach entails dividing the sequence into
two parts, prioritizing the computation of the sequence-dependent component (the attention part)
before tackling the non-sequence-dependent section. The only requirement is to preserve the order of
attention calculations between the two micro-batches.
Similar to other overlap strategies for computation and communication, the decode phase involves
minimal computation and communication, rendering overlap unbeneficial. Therefore, our primary
focus is on the prefill stage. We evaluated models of approximately 30B and 70B on both the 4090
and A800 platforms. The time consumption was reduced by approximately 35% on the 4090 and
around 15% on the A800.
In summary, our main contributions include the following points:

• We propose a novel computation-communication overlap method for LLM inference;

• We tested our strategy on different model sizes and machine types, achieving better results
than existing methods;

• We also discussed the scenarios where this method is applicable and how to optimize it for
different scenarios.

2
2 Background

2.1 LLM Inference

Autoregressive decoding: LLM inference request processing consists of two distinct phases: a prefill
phase followed by a decode phase. The prefill phase processes the entire user input prompt and
produces the first output token. Subsequently, the decode phase generates output tokens one at a time
wherein the token generated in the previous step is passed through the model to generate the next
token until a special end-ofsequence token is generated. Note that the decode phase requires access
to all the keys and values associated with all the previously processed tokens to perform the attention
operation. To avoid repeated recomputation, contemporary LLM inference systems store activations
in KV-cache[8].
Tensor Parallelism (TP) is a technique used to improve the efficiency of training and running Large
Language Models (LLMs) by dividing the model’s parameters into smaller segments for parallel
processing across multiple devices. This approach reduces memory consumption and enables the
use of larger models by distributing computational tasks among several GPUs. A key aspect of
tensor parallelism is the integration of inter-card communication mechanisms, which are crucial
for maintaining data consistency and coherence across distributed computational resources. During
the parallel processing of model components such as attention blocks and multi-layer perceptrons,
communication operations are included to ensure consistent results, even when computed across
separate devices. Frameworks like Megatron-LM [10] utilize tensor parallelism, including these
communication protocols, to optimize the performance and scalability of LLMs. By efficiently
managing data exchange between GPUs, tensor parallelism supports the development and deployment
of more powerful language models through improved parallel computation and memory utilization.
Chunked-prefills: The chunked prefill method is engineered to enhance the efficiency of the decoding
phase and mitigate delays associated with prolonged prefill computations. As described in SARATHI
[2], this approach divides a prefill request into equally sized computational chunks, enabling the
prefill phase of a prompt to be processed across several decoding iterations (each utilizing a subset of
the prompt tokens). This strategy facilitates the creation of multiple decode-maximal batches from a
single prefill request, thereby optimizing the extent to which decodings can leverage the prefill.

2.2 Overlap of Computation and Communication

In large language model (LLM) inference, "Overlap of Computation and Communication" is a crucial
optimization technique aimed at enhancing performance and efficiency. This approach addresses the
inherent latency between computation and data transmission in distributed systems.
gemm overlap[6, 7]: The LLM architecture comprises stacked transformer blocks, each integrating
attention mechanisms (comprising QKV matrices, attention computations, and an o_proj matrix)
along with MLP components (featuring up/gate matrices, activation functions, and down matrices).
Both segments incorporate collective communication processes. The core strategy for optimizing
performance involves aligning the execution of specific matrix operations with communication
tasks: in the attention component, the o_proj matrix computation is overlapped with collective
communication; similarly, in the MLP segment, the down matrix operation is synchronized with
communication activities. This approach aims to enhance efficiency by minimizing idle periods
during data exchange across the distributed system.
requests overlap[5]: is an optimization technique tailored for managing multiple requests in parallel
computing settings. It consolidates incoming requests into a unified batch, which is then subdivided
into two smaller micro-batches. These micro-batches are intentionally crafted to share overlapping
content, enabling one batch to perform computational operations while the other handles communica-
tion tasks. By cyclically swapping the functions of these micro-batches between computation and
communication, the method ensures efficient use of system resources. This results in diminished idle
periods and an uplift in the system’s overall throughput. The strategy ingeniously capitalizes on the
synergistic relationship between computational and communicational processes, thereby boosting the
responsiveness and scalability of distributed systems.

3
3 ISO
One significant challenge in achieving optimal overlap between matrix computations and commu-
nication lies in the fact that, in numerous situations, communication times frequently surpass the
duration of adjacent matrix operations, thereby hindering ideal synchronization. At the request level,
the prerequisite for multiple concurrent requests may not always be met, particularly in scenarios
characterized by low traffic. Furthermore, variations in the computational and communicational
demands across different requests can often result in less than perfect overlap. Such overlaps among
multiple requests may, in some cases, extend the processing time for individual requests. In light of
these challenges, we introduce ISO as a proposed solution.

3.1 Overlap strategy

Figure 1 depicts our strategy for dividing and overlapping tasks, drawing inspiration from the concept
of chunked prefill. This approach involves segmenting a single request into multiple parts along the
sequence dimension. Each prefill operation computes only one segment, leveraging the kv cache
stored from preceding chunk computations without introducing errors. We split the request into two
chunks, facilitating computational and communicational overlap between them. It is essential to
maintain the order of computation for chunks within the attention section of the same layer, allowing
the second half of the sequence to initiate attention calculations once the first half has completed
writing to the kv cache.

3.2 Asymmetric optimization strategy

In practical applications, there may be an imbalance in the proportion of computation and communi-
cation. In response to this phenomenon, we have implemented some optimization strategies:

Figure 2: Optimization Strategy of (a) Communication dominates, (b) Computation dominates.

Communication dominates: As illustrated in Figure 2 (a), this scenario predominantly affects the
4090 platform, characterized by its subpar communication capabilities. To mitigate this issue, we
have optimized the communication data by converting float16 data to int8 through quantization tech-
niques. This approach has significantly reduced the communication proportion from approximately
75% to roughly 50%. Additionally, we are exploring various methods to enhance point-to-point
communication efficiency.

4
Computation dominates: : As shown in Figure 2 (b), this situation commonly arises in high-
performance GPUs, such as the A or H series, where the computation proportion often exceeds
75%, with longer prompts further increasing this percentage. Concurrently, when accounting for the
overlap between computation and communication, NCCL communication tends to consume extra
SMs, thereby prolonging the computation time. The impact on A800 ranges from 15% to 20%,
whereas it is negligible on the 4090 platform. Given the challenge of reducing the computation
proportion, our focus has been on minimizing the adverse effects of NCCL communication on
computation. Observations reveal that communication seldom surpasses the duration of the initial
matrix computation. In response, we have partitioned matrix computations into several segments
and employed multiple kernel launches. This strategy ensures that matrix computations can fully
exploit the computational power once the communication phase concludes. We are also exploring
more effective methods for overlapping gemm computations with communication, as mentioned in
Flux[4].

4 Evaluation
4.1 Experiment Settings

GPU Types, Models, and Baselines In our experiments, we utilized both the 4090 and A800 GPUs
to evaluate dense models with sizes of 30B (MHA) and 70B (GQA). The primary distinction from
the baseline configuration lies in the enablement of ISO, while all other settings remain identical. Our
quantization approach entails utilizing int8 for weights, kv cache, and gemm operations, alongside
float16 for activations. Notably, the 4090 GPU employs int8 for data transmission.

Metrics Our primary focus is on comparing the time consumption for the first token processing,
specifically for a batch size of 1. Given the sensitivity surrounding internal performance data, we
opt not to disclose exact numerical values. Instead, we present the improvements achieved as ratios,
providing insights into the relative enhancements without revealing proprietary details.

4.2 End-to-end Results

The percentage decrease in the prefill stage duration at different prompt lengths
GPU model
1k 2k 4k 8k 16k 32k 64k 128k
30b 38% 42% 43% 44% 47% 48% - -
4090 4 cards
70b 43% 44% 45% 46% 47% 46% - -
30b 11% 10% 18% 21% 30% 33% 36% -
4090 8 cards
70b 14% 19% 22% 23% 35% 42% 39% -
30b 0% 8% 18% 11% 12% 9% 10% 5%
A800 4 cards
70b -6% 2% 8% 10% 9% 8% 8% 3%
30b 8% 24% 22% 20% 16% 25% 11% 10%
A800 8 cards
70b 3% 9% 14% 15% 16% 15% 14% 7%

Table 1: The speedup ratio for the prefill stage duration.

In real-world operating environments, scenarios involving 1k/2k inputs with a batch size of 1 are
uncommon. Generally, batching is utilized, and our main focus is on prompt lengths that exceed
4k. As shown in Table 1, the average improvement observed on the 4090 platform is approximately
35%, while the A800 platform experiences an improvement of about 15%. These results stem from
our preliminary optimization efforts, and we continue to explore optimizations for less favorable
scenarios. The substantial benefit on the 4090 with four cards is attributed to the balanced distribution
of computation and communication achieved after adopting int8 for communication. However,
when scaling to eight cards on the 4090, the communication ratio escalates, particularly for short
prompt lengths. In contrast, the A800 exhibits relatively modest gains due to the dominant role
of computation. The concurrent execution of computation and communication can hinder gemm
operations, thus limiting the potential improvements. Shorter prompt lengths lead to greater losses
from splitting, whereas longer prompts result in a lower communication proportion. We have also
experimented with overlapping communication and matrix computations on the A800, yielding
marginal gains of 2%-5% and even negative gains on the 4090. In all tested scenarios, ISO surpasses

5
this approach. Regarding request-level overlap, we have yet to implement it, but theoretically, ISO
would offer superior performance by enabling a more uniform distribution of micro-batches across
the same requests.

5 Conclusion

We introduce a novel method for overlapping computation and communication, which achieves a
reduction of approximately 35% in the duration of the prefill stage on the 4090 platform and about
15% on the A800 platform, thereby showcasing the efficacy of the ISO method. Concurrently, we
conduct an analysis and comparison of existing approaches for computation-communication overlap,
identifying that our methodology outperforms these conventional techniques.

6 Discussion

The Challenge of Imbalanced Computation and Communication Proportions

The ideal scenario for maximizing the benefits of overlapping computation and communication is
when both components have approximately equal shares. To elaborate further, it is essential to achieve
a balance between the attention computation segment and the communication segment, as well as
between the MLP computation segment and the communication segment. If there is an imbalance,
the overall duration will be determined by the longest segment among these, leading to a decrease in
the potential benefits gained from overlapping these processes.
However, in the case of the A800, computation constitutes a significantly larger portion, and reducing
this proportion presents a considerable challenge. It is evident that expanding to 8 cards enhances the
effectiveness of ISO. Conversely, the 4090 exhibits a disproportionately large share of communication,
which is comparatively easier to mitigate. As previously mentioned, strategies such as quantization
or optimizing point-to-point communication at the driver level can be employed. These two scenarios
represent rather extreme cases, and newer chips may lie somewhere in between, generally yielding
positive gains from ISO.
During our testing, we identified a novel form of imbalance in certain situations. For the A800, both
attention and MLP computations far exceed communication, whereas for the 4090, communication
surpasses both attention and MLP computations. However, in some specific instances, the time
consumed by communication falls between that of attention and MLP. Although this issue is not
pronounced in the current context, it is likely to become more significant as the 4090 undergoes
optimization. To address this, we propose several potential solutions.
One approach tackles the imbalance within attention itself, where the computational demand of the
latter half of the sequence markedly exceeds that of the former half. In such cases, we can adjust the
sequence segmentation strategy, for example, by having the first micro-batch compute 60% of the
sequence length and the second compute 40%. Another solution addresses the imbalance between
attention and MLP. Typically, attention computations are more time-consuming, and we can balance
the computations of attention and MLP through more intricate micro-batch segmentation strategies,
as illustrated in Figure 3.

Figure 3: Adaptive Attention and MLP Imbalance Splitting Strategy.

6
Benefits for the Decode Stage
Given that the computational and communication loads during the decoding phase are relatively
modest, the advantages of overlapping computation and communication are correspondingly limited
and may even result in negative returns. Nevertheless, speculative sampling could potentially offer
benefits on the 4090 with 4 cards, a direction we are actively investigating. The primary rationale
behind this is that speculative sampling involves a greater number of input tokens, thereby increasing
the relative computational volume.

7
References
[1] ChatGPT: Optimizing Language Models for Dialogue, 2022. https://openai.com/blog/
chatgpt/.
[2] Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and
Ramachandran Ramjee1. SARATHI: Efficient LLM Inference by Piggybacking Decodes with
Chunked Prefills. arXiv preprint arXiv:2308.16369, 2023.
[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel
Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.
Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz
Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec
Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In
Advances in Neural Information Processing Systems, pages 1877–1901, 2020.
[4] Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun
Zhang, Zuquan Song, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu. Flux: Fast software-based
communication overlap on gpus through kernel fusion, 2024.
[5] Jiangsu Du, Jinhui Wei, Jiazhi Jiang, Shenggan Cheng, Dan Huang, Zhiguang Chen, and Yutong
Lu. Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model
Inference. https://dl.acm.org/doi/10.1145/3627535.3638466, 2024.
[6] Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan
Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. Breaking the Computation
and Communication Abstraction Barrier in Distributed Machine Learning Workloads. arXiv
preprint arXiv:2105.05720, 2022.
[7] Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D. Sinclair. T3:
Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives. arXiv
preprint arXiv:2401.16677, 2024.
[8] Reiner Pope and et al. Sholto Douglas. Efficiently scaling transformer inference. Proceedings
of Machine Learning and Systems 5 (2023)., 2023.
[9] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[10] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan
Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model
parallelism. arXiv preprint arXiv:1909.08053, 2019.

Krunker Script
No ratings yet
Krunker Script
5 pages
Hu Et Al. - 2024 - Inference Without Interference Disaggregate LLM I
No ratings yet
Hu Et Al. - 2024 - Inference Without Interference Disaggregate LLM I
15 pages
2501.08192v1
No ratings yet
2501.08192v1
16 pages
（LoRA稀疏化）2024-CMU-CS-24
No ratings yet
（LoRA稀疏化）2024-CMU-CS-24
45 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenMPI Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
2501.14808v2
No ratings yet
2501.14808v2
15 pages
Cloud Brokering
From Everand
Cloud Brokering
Felipe Díaz-Sánchez
No ratings yet
NeurIPS-2018-lag-lazily-aggregated-gradient-for-communication-efficient-distributed-learning-Paper
No ratings yet
NeurIPS-2018-lag-lazily-aggregated-gradient-for-communication-efficient-distributed-learning-Paper
11 pages
HPC Note
No ratings yet
HPC Note
39 pages
Pub/Sub Systems and Message-Driven Architectures: Definitive Reference for Developers and Engineers
From Everand
Pub/Sub Systems and Message-Driven Architectures: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling
No ratings yet
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling
16 pages
GASNet Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
GASNet Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
From Everand
Parallel Programming with MPI: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
MPICH Essentials: Definitive Reference for Developers and Engineers
From Everand
MPICH Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
MQTT Protocol in Practice: Definitive Reference for Developers and Engineers
From Everand
MQTT Protocol in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Group3_Parallel_Computing_Techniques_presentation power point 2025 (2)
No ratings yet
Group3_Parallel_Computing_Techniques_presentation power point 2025 (2)
27 pages
Eecs 2024 108
No ratings yet
Eecs 2024 108
48 pages
S62797 - LLM Inference Sizing_ Benchmarking End-to-End Inference Systems
No ratings yet
S62797 - LLM Inference Sizing_ Benchmarking End-to-End Inference Systems
36 pages
Duan Et Al. - 2024 - MuxServe Flexible Multiplexing For Efficient Mult
No ratings yet
Duan Et Al. - 2024 - MuxServe Flexible Multiplexing For Efficient Mult
12 pages
AMQP Protocol in Depth: Definitive Reference for Developers and Engineers
From Everand
AMQP Protocol in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
From Everand
Principles of Multiple Spanning Tree Protocol: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NanoFlow- Towards Optimal Large Language Model Serving Throughput
No ratings yet
NanoFlow- Towards Optimal Large Language Model Serving Throughput
19 pages
A Survey of Efficient LLM Inference Serving
No ratings yet
A Survey of Efficient LLM Inference Serving
20 pages
Charm++ Programming and Applications: Definitive Reference for Developers and Engineers
From Everand
Charm++ Programming and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
unit1 2 and 3
No ratings yet
unit1 2 and 3
76 pages
CHATGPT DALL.E 3: Complete Guide. Third Edition
From Everand
CHATGPT DALL.E 3: Complete Guide. Third Edition
Hesham Mohamed Elsherif
No ratings yet
CC ZG501 Course Handout
No ratings yet
CC ZG501 Course Handout
8 pages
Breaking The Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads
No ratings yet
Breaking The Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads
15 pages
An Introduction To Machine Learning Communications
No ratings yet
An Introduction To Machine Learning Communications
11 pages
Paho MQTT Client Libraries in Practice: Definitive Reference for Developers and Engineers
From Everand
Paho MQTT Client Libraries in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
HPC Lecture 2 Points
No ratings yet
HPC Lecture 2 Points
7 pages
2404.07947v1
No ratings yet
2404.07947v1
16 pages
GSASG Global Sparsification With Adaptive Aggregated Stochastic Gradients For Communication-Efficient Federated Learning
No ratings yet
GSASG Global Sparsification With Adaptive Aggregated Stochastic Gradients For Communication-Efficient Federated Learning
14 pages
MassTransit for Distributed .NET Applications: Definitive Reference for Developers and Engineers
From Everand
MassTransit for Distributed .NET Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Stream Processing Techniques and Patterns: Definitive Reference for Developers and Engineers
From Everand
Stream Processing Techniques and Patterns: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Omni-Path Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Omni-Path Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
hpc unit 3
No ratings yet
hpc unit 3
3 pages
1 Module 1 Parallelism Fundamentals Motivation Key Concepts and Challenges Parallel Computing
No ratings yet
1 Module 1 Parallelism Fundamentals Motivation Key Concepts and Challenges Parallel Computing
81 pages
Hybridoffline Online
No ratings yet
Hybridoffline Online
13 pages
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
From Everand
Blue-Green Deployment Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
GossipFL A Decentralized Federated Learning Framework With Sparsified and Adaptive Communication
No ratings yet
GossipFL A Decentralized Federated Learning Framework With Sparsified and Adaptive Communication
14 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
S62192
No ratings yet
S62192
127 pages
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Cilk Programming and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
twinpilots (1)
No ratings yet
twinpilots (1)
7 pages
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
From Everand
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
Nelson Ambrose
No ratings yet
p1
No ratings yet
p1
30 pages
P 1
No ratings yet
P 1
44 pages
GLBP Implementation and Troubleshooting: Definitive Reference for Developers and Engineers
From Everand
GLBP Implementation and Troubleshooting: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Llama3-ISCA25
No ratings yet
Llama3-ISCA25
14 pages
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
From Everand
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Scalable Systems with C: Optimizing Performance and Portability
From Everand
Building Scalable Systems with C: Optimizing Performance and Portability
Larry Jones
No ratings yet
Concurrency in C++: Writing High-Performance Multithreaded Code
From Everand
Concurrency in C++: Writing High-Performance Multithreaded Code
Robert Johnson
No ratings yet
RabbitMQ in Practice: Definitive Reference for Developers and Engineers
From Everand
RabbitMQ in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Co-Evolution of Metamodels and Model Transformations: An operator-based, stepwise approach for the impact resolution of metamodel evolution on model transformations.
From Everand
Co-Evolution of Metamodels and Model Transformations: An operator-based, stepwise approach for the impact resolution of metamodel evolution on model transformations.
Steffen Kruse
No ratings yet
Reliable Messaging with Nanomsg: Definitive Reference for Developers and Engineers
From Everand
Reliable Messaging with Nanomsg: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
solve these
No ratings yet
solve these
6 pages
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
From Everand
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
7gt
No ratings yet
7gt
40 pages
4 Motherboard
No ratings yet
4 Motherboard
4 pages
Python Unit 5-9
No ratings yet
Python Unit 5-9
28 pages
Architecture Diagram BI Platform 4.3
No ratings yet
Architecture Diagram BI Platform 4.3
1 page
DA0QL4MB8E0 REV E Schematic Diagram
No ratings yet
DA0QL4MB8E0 REV E Schematic Diagram
44 pages
Karel Hendrych
No ratings yet
Karel Hendrych
23 pages
Module 1 Intro To Computers
No ratings yet
Module 1 Intro To Computers
9 pages
Quadratic Programming
No ratings yet
Quadratic Programming
38 pages
Auth0 Amd Cs Case Study
No ratings yet
Auth0 Amd Cs Case Study
7 pages
Introduction To VB (Lesson One)
No ratings yet
Introduction To VB (Lesson One)
22 pages
Database Notes
No ratings yet
Database Notes
4 pages
MAD Lab Exp-8 PCE18IT016
No ratings yet
MAD Lab Exp-8 PCE18IT016
4 pages
Manual Basler Avr Decs200 1C
100% (1)
Manual Basler Avr Decs200 1C
12 pages
STRING - Built-In Methods
No ratings yet
STRING - Built-In Methods
1 page
22616-2024-summer-model-answer-paper (1)
No ratings yet
22616-2024-summer-model-answer-paper (1)
28 pages
IoT Prevoius Paper
No ratings yet
IoT Prevoius Paper
1 page
Introduction to Python for Econometrics, Statistics and Data Analysis Kevin Sheppard All Chapters Instant Download
No ratings yet
Introduction to Python for Econometrics, Statistics and Data Analysis Kevin Sheppard All Chapters Instant Download
47 pages
History of Internet
No ratings yet
History of Internet
8 pages
An Operating System
No ratings yet
An Operating System
12 pages
Cybersecurity Evolution a Visual Timeline
No ratings yet
Cybersecurity Evolution a Visual Timeline
2 pages
On Text To Speech Conversion Using OCR
50% (2)
On Text To Speech Conversion Using OCR
26 pages
ISC Practical 2024
No ratings yet
ISC Practical 2024
20 pages
Backend Developer Roadmap 2025
No ratings yet
Backend Developer Roadmap 2025
8 pages
Csc 102 Lecture Note-1
No ratings yet
Csc 102 Lecture Note-1
16 pages
Physics
No ratings yet
Physics
17 pages
CSD7-UM001A-RSWare-EN-January 2017
No ratings yet
CSD7-UM001A-RSWare-EN-January 2017
143 pages
Main Lua
No ratings yet
Main Lua
18 pages
Huawei CloudEngine S6330 H Series Switches Brochure - 2
No ratings yet
Huawei CloudEngine S6330 H Series Switches Brochure - 2
8 pages
robo guide installation
0% (1)
robo guide installation
2 pages
Ccservices New
No ratings yet
Ccservices New
10 pages

Overlap of Computation and Communication Within Seqenence For LLM Inference

Uploaded by

Overlap of Computation and Communication Within Seqenence For LLM Inference

Uploaded by

ISO: Overlap of Computation and Communication

within Seqenence For LLM Inference

Bin Xiao1 Lei Su1

• We propose a novel computation-communication overlap method for LLM inference;

2.1 LLM Inference

2.2 Overlap of Computation and Communication

3.1 Overlap strategy

3.2 Asymmetric optimization strategy

Figure 2: Optimization Strategy of (a) Communication dominates, (b) Computation dominates.

4.2 End-to-end Results

Table 1: The speedup ratio for the prefill stage duration.

The Challenge of Imbalanced Computation and Communication Proportions

Figure 3: Adaptive Attention and MLP Imbalance Splitting Strategy.

You might also like