Skip to content

Commit 8f439c9

Browse files
fix broken images
1 parent 1416922 commit 8f439c9

File tree

2 files changed

+8
-8
lines changed

2 files changed

+8
-8
lines changed

_posts/2025-01-22-cudagraph.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,12 +50,12 @@ In the case of DeepSeek-V2, even with larger batches (average input length of 10
5050

5151
The original kernel allocated thread blocks based on the number of attention heads and batch size. However, this design caused load imbalances in thread blocks when processing batches of uneven lengths, negatively affecting performance. To address this, LightLLM redesigned the Decoding kernel for Cuda Graph using the concept of Virtual Stream Processors (VSM). The issue with the previous kernel was the dynamic change in request lengths, which caused the intermediate memory size to vary. In the new design, the number of thread blocks (Grid Size) is fixed, and the context of each request is divided into fixed-size blocks. Each thread block iterates over all blocks, translating the dynamically changing lengths into a fixed number of iterations, ensuring that intermediate memory usage depends only on batch size, eliminating the need for pre-allocated memory. Additionally, the fixed-size blocks ensure that each thread block’s load is nearly balanced, improving performance when handling batches of uneven lengths. Testing of the new DeepSeekV2 Decoding kernel showed that the redesigned kernel significantly outperforms the previous design in decoding speed for longer inputs, even with the same batch size and sequence lengths.
5252

53-
![Rate](/assets/images/blogs/01-cudagraph/rate.png)
53+
<img src="{{ site.baseurl }}/assets/images/blogs/01-cudagraph/rate.png" style="zoom: 100%;" />
5454

5555

5656
We also evaluated the scalability of the new kernel against the original implementation. The test batch consisted of 128 requests of uniform length, ranging from 256 to 8192, with outlier requests set to 8k length. The results showed that the new kernel performed better overall, with minimal impact from outlier requests (which are significantly longer than the average length), making it more stable compared to the original kernel.
5757

58-
![Scalibility](/assets/images/blogs/01-cudagraph/scalibility.png)
58+
<img src="{{ site.baseurl }}/assets/images/blogs/01-cudagraph/scalibility.png" style="zoom: 100%;" />
5959

6060
Experimental Environment:
6161
GPU: Single NVIDIA H800

_posts/2025-06-15-pre3.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,11 +27,11 @@ Pre$^3$ addresses these efficiency limitations by exploiting the power of **Dete
2727

2828
One of Pre$^3$'s key strengths is its **preprocessing stage**, where it precomputes "prefix-conditioned edges". This anticipatory analysis offers several advantages:
2929

30-
<img src="/assets/images/blogs/03-pre3/automaton_construction_00.png" style="zoom: 40%;" />
30+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/automaton_construction_00.png" style="zoom: 40%;" />
3131

3232
* **Ahead-of-Time Analysis:** Unlike reactive validation, Pre$^3$ proactively analyzes all possible grammar transitions before the LLM even begins generating. [cite_start]This is analogous to pre-planning all possible routes and turns on a journey before setting off.
3333

34-
<img src="/assets/images/blogs/03-pre3/overview_00.png" style="zoom: 100%;" />
34+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/overview_00.png" style="zoom: 100%;" />
3535

3636
* **Enabling Parallel Transition Processing:** The deterministic nature of DPDAs, combined with precomputation, allows for parallel processing of transitions. In a non-deterministic PDA, multiple choices might exist for a given state, input, and stack top, necessitating sequential exploration. DPDA's determinism, however, ensures a unique next step, paving the way for parallel computations.
3737

@@ -41,7 +41,7 @@ One of Pre$^3$'s key strengths is its **preprocessing stage**, where it precompu
4141

4242
These conditions guarantee that at most one transition is available in any situation, making the automaton deterministic. This determinism is precisely what empowers Pre$^3$ to perform effective precomputation and parallel processing.
4343

44-
<img src="/assets/images/blogs/03-pre3/prefix_conditioned_edge_00.png" style="zoom: 20%;" />
44+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/prefix_conditioned_edge_00.png" style="zoom: 20%;" />
4545

4646
#### 2. Streamlining Automata Structure
4747

@@ -61,7 +61,7 @@ To assess the improvement in decoding efficiency, the per-step decoding overhead
6161

6262
* **Key Finding:** Pre$^3$ consistently introduces less overhead than previous SOTA systems, outperforming Outlines and llama.cpp, and maintaining a consistent advantage over XGrammar. For instance, XGrammar showed up to 37.5% higher latency (147.64 ms vs. 92.23 ms) at batch size 512 compared to unconstrained decoding when evaluating on Meta-Llama-3-8B. This performance gap widened with increasing batch sizes.
6363

64-
<img src="/assets/images/blogs/03-pre3/overhead_00.png" style="zoom: 30%;" />
64+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/overhead_00.png" style="zoom: 30%;" />
6565

6666
#### Large Batch Inference Efficiency and Real-world Deployment Throughput
6767

@@ -71,9 +71,9 @@ To assess real-world performance, the throughput of Pre$^3$ and XGrammar was com
7171

7272
* **Key Finding:** Pre$^3$ consistently outperformed XGrammar in all scenarios, achieving latency reductions of up to 30%. The advantage was more pronounced at larger batch sizes, demonstrating Pre$^3$'s scalability. Pre$^3$ also demonstrated an improvement over XGrammar in real-world serving, achieving up to 20% higher throughput at higher concurrency levels.
7373

74-
![Result2](/assets/images/blogs/03-pre3/table.png)
74+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/table.png" style="zoom: 100%;" />
7575

76-
![Result3](/assets/images/blogs/03-pre3/serving_00.png)
76+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/serving_00.png" style="zoom: 100%;" />
7777

7878
### Conclusion
7979

0 commit comments

Comments
 (0)