ModelTC
diff --git a/‎_posts/2025-01-22-cudagraph.md‎
Lines changed: 1 addition & 4 deletions b/‎_posts/2025-01-22-cudagraph.md‎
Lines changed: 1 addition & 4 deletions
diff --git a/‎_posts/2025-02-16-lightllm-v0.1.0.md‎
Lines changed: 29 additions & 0 deletions b/‎_posts/2025-02-16-lightllm-v0.1.0.md‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎assets/images/blogs/02-lightllm250216/lightllm-performace.png‎
12.8 KB b/‎assets/images/blogs/02-lightllm250216/lightllm-performace.png‎
12.8 KB
diff --git a/‎assets/images/blogs/02-lightllm250216/lightllm.png‎
175 KB b/‎assets/images/blogs/02-lightllm250216/lightllm.png‎
175 KB
@@ -1,12 +1,9 @@
 ---
 title: Reducing Overhead with Cuda Graph
 categories:
-- Feature
+- By MTC Team
 excerpt: |
   Cuda Graph is used to reduce overhead in LightLLM.
-feature_text: |
-  ## Reducing Overhead with Cuda Graph
-  Cuda Graph optimizes operations by packaging kernel launches, Tensor allocations, and similar tasks into a computational graph. 
 ---
 
 Cuda Graph optimizes operations by packaging kernel launches, Tensor allocations, and similar tasks into a computational graph. This graph allows for direct replay of operations, significantly reducing the overhead of repeated execution. While such overhead is negligible during the computation-intensive prefill phase, it becomes more pronounced during the decode phase.
 
@@ -0,0 +1,29 @@
+---
+title: LightLLM v4.0.0: Minimal Inter-Process Communication Overhead, Fastest DeepSeek-R1 Serving Performance on Single H200, and Prototype Support for PD seperation 
+categories:
+- By MTC Team
+excerpt: |
+  We are delighted to announce the release of LightLLM v4.0.0.
+---
+
+We are delighted to announce the release of LightLLM v4.0.0. After a year of continuous efforts, we have comprehensively upgraded the LightLLM architecture. We implemented a cross-process accessible request object, significantly reducing inter-process communication overhead, especially in high-concurrency scenarios. Meanwhile, we have conducted in-depth optimization for DeepSeek R1, achieving state-of-the-art performance among current open-source frameworks on a single H200 machine. Furthermore, we have innovatively proposed a prototype architecture implementation of PD-separation.
+
+### Framework
+
+The new framework of LightLLM is shown in the figure below. We have retained the previous three-process architecture design and designed a request object that can be accessed across processes via ctypes. We store the metadata of the requests in shared memory, ensuring that only a minimal amount of necessary data is communicated between processes. Additionally, we have implemented the folding of scheduling and model inference, and nearly eliminated the communication overhead between the scheduler and modelrpc in the previous version of the router. We implemented the CacheTensorManager class, which takes over the allocation and release of Torch tensors within the framework. This maximizes the cross-layer sharing of tensors at runtime, as well as memory sharing between different CUDA graphs. On an 8x80GB H100 machine, with the DeepSeek-v2 model, LightLLM can run 200 CUDA graphs concurrently without running out of memory (OOM). We will subsequently publish a series of blog posts introducing the architecture of LightLLM.
+
+{% include relative-figure.html image="/assets/images/blogs/02-lightllm250216/lightllm.png" %}
+
+
+### Optimzation on DeepSeek
+Due to the different computational characteristics of Prefill (compute intensive) and Decode (memory intensive), we have implemented distinct optimizations for the DeepSeek MLA. During the Prefill stage, we decompress the KV cache, while in the Decode stage, we compress the query (q) to achieve optimal performance. Additionally, we leveraged OpenAI's Triton to implement high-performance Decode MLA and fused MoE kernels.
+
+
+
+### Performance
+The figure below shows the performance comparison of LightLLM, sglang==0.4.3, vllm==0.7.2, and trtllm==0.17.0 on a single H200 machine, using DeepSeek-R1 (num_clients = 100). The input length of the test data is 1024, and the output follows a Gaussian distribution with a mean of 128.  LightLLM achieve the better performance.
+
+{% include relative-figure.html image="/assets/images/blogs/02-lightllm250216/lightllm-performance.png" %}
+
+### Acknowledgment
+We learned a lot from the following projects when developing LightLLM, including [vLLM](https://github.com/vllm-project/vllm), [sglang](https://github.com/sgl-project/sglang), [OpenAI Triton](https://github.com/openai/triton). We also warmly welcome the open-source community to help improve LightLLM.