Skip to content

Commit 60e29ec

Browse files
committed
add lightllm blog 20250216
1 parent a2fc8ac commit 60e29ec

File tree

4 files changed

+30
-4
lines changed

4 files changed

+30
-4
lines changed

_posts/2025-01-22-cudagraph.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,9 @@
11
---
22
title: Reducing Overhead with Cuda Graph
33
categories:
4-
- Feature
4+
- By MTC Team
55
excerpt: |
66
Cuda Graph is used to reduce overhead in LightLLM.
7-
feature_text: |
8-
## Reducing Overhead with Cuda Graph
9-
Cuda Graph optimizes operations by packaging kernel launches, Tensor allocations, and similar tasks into a computational graph.
107
---
118

129
Cuda Graph optimizes operations by packaging kernel launches, Tensor allocations, and similar tasks into a computational graph. This graph allows for direct replay of operations, significantly reducing the overhead of repeated execution. While such overhead is negligible during the computation-intensive prefill phase, it becomes more pronounced during the decode phase.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
title: LightLLM v4.0.0: Minimal Inter-Process Communication Overhead, Fastest DeepSeek-R1 Serving Performance on Single H200, and Prototype Support for PD seperation
3+
categories:
4+
- By MTC Team
5+
excerpt: |
6+
We are delighted to announce the release of LightLLM v4.0.0.
7+
---
8+
9+
We are delighted to announce the release of LightLLM v4.0.0. After a year of continuous efforts, we have comprehensively upgraded the LightLLM architecture. We implemented a cross-process accessible request object, significantly reducing inter-process communication overhead, especially in high-concurrency scenarios. Meanwhile, we have conducted in-depth optimization for DeepSeek R1, achieving state-of-the-art performance among current open-source frameworks on a single H200 machine. Furthermore, we have innovatively proposed a prototype architecture implementation of PD-separation.
10+
11+
### Framework
12+
13+
The new framework of LightLLM is shown in the figure below. We have retained the previous three-process architecture design and designed a request object that can be accessed across processes via ctypes. We store the metadata of the requests in shared memory, ensuring that only a minimal amount of necessary data is communicated between processes. Additionally, we have implemented the folding of scheduling and model inference, and nearly eliminated the communication overhead between the scheduler and modelrpc in the previous version of the router. We implemented the CacheTensorManager class, which takes over the allocation and release of Torch tensors within the framework. This maximizes the cross-layer sharing of tensors at runtime, as well as memory sharing between different CUDA graphs. On an 8x80GB H100 machine, with the DeepSeek-v2 model, LightLLM can run 200 CUDA graphs concurrently without running out of memory (OOM). We will subsequently publish a series of blog posts introducing the architecture of LightLLM.
14+
15+
{% include relative-figure.html image="/assets/images/blogs/02-lightllm250216/lightllm.png" %}
16+
17+
18+
### Optimzation on DeepSeek
19+
Due to the different computational characteristics of Prefill (compute intensive) and Decode (memory intensive), we have implemented distinct optimizations for the DeepSeek MLA. During the Prefill stage, we decompress the KV cache, while in the Decode stage, we compress the query (q) to achieve optimal performance. Additionally, we leveraged OpenAI's Triton to implement high-performance Decode MLA and fused MoE kernels.
20+
21+
22+
23+
### Performance
24+
The figure below shows the performance comparison of LightLLM, sglang==0.4.3, vllm==0.7.2, and trtllm==0.17.0 on a single H200 machine, using DeepSeek-R1 (num_clients = 100). The input length of the test data is 1024, and the output follows a Gaussian distribution with a mean of 128. LightLLM achieve the better performance.
25+
26+
{% include relative-figure.html image="/assets/images/blogs/02-lightllm250216/lightllm-performance.png" %}
27+
28+
### Acknowledgment
29+
We learned a lot from the following projects when developing LightLLM, including [vLLM](https://github.com/vllm-project/vllm), [sglang](https://github.com/sgl-project/sglang), [OpenAI Triton](https://github.com/openai/triton). We also warmly welcome the open-source community to help improve LightLLM.
12.8 KB
Loading
175 KB
Loading

0 commit comments

Comments
 (0)