Skip to content

Commit 84d29f6

Browse files
optimize blog
1 parent 9c0ae06 commit 84d29f6

File tree

8 files changed

+56
-9
lines changed

8 files changed

+56
-9
lines changed

.DS_Store

0 Bytes
Binary file not shown.

_posts/2025-06-15-pre3.md

Lines changed: 56 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,11 @@ mathjax: true
1010

1111
The ubiquitous rise of Large Language Models (LLMs) has amplified the demand for efficient structured content generation, from code to structured data formats like JSON. While existing methods for generating outputs conforming to specific grammars (like LR(1) grammars) have enabled impressive capabilities, they often introduce significant runtime overhead, particularly under large inference batching scenarios. This post introduces **Pre$^3$**, a novel approach that leverages Deterministic Pushdown Automata (DPDA) to revolutionize the speed and efficiency of structured LLM generation.
1212

13+
<div style="text-align: center;">
14+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/constrained_decoding_00.png" style="zoom: 30%;" />
15+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Structured generation means LLM produces formatted data (JSON, SQL, code) instead of free text, enforce strict syntax rules for machine-readable output.</p>
16+
</div>
17+
1318
### The Bottleneck of Current Structured LLM Generation
1419

1520
Current state-of-the-art methods for constrained LLM decoding typically involve parsing LR(1) grammars into Pushdown Automata (PDAs). While effective, this process incurs substantial runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. This sequential and context-aware validation becomes a significant bottleneck, akin to navigating a complex maze one step at a time, checking every turn.
@@ -21,17 +26,48 @@ Current state-of-the-art methods for constrained LLM decoding typically involve
2126

2227
### Pre$^3$: A Paradigm Shift with Deterministic Pushdown Automata
2328

24-
Pre$^3$ addresses these efficiency limitations by exploiting the power of **Deterministic Pushdown Automata (DPDA)**. The core innovation lies in transforming the traditional, often non-deterministic, PDA-based decoding into a highly optimized, deterministic process.
29+
Pre$^3$ addresses these efficiency limitations by exploiting the power of **Deterministic Pushdown Automata (DPDA)**. The core innovation lies in transforming the traditional, often non-deterministic, PDA-based decoding into a highly optimized, deterministic process. The workflow of Pre$^3$ is illustrated in the figure below.
30+
31+
<div style="text-align: center;">
32+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/workflow_00.png" style="zoom: 60%;" />
33+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Pre³ Workflow outlines our approach to convert LR(1) grammars into DPDAs by precomputing all transitions, enabling efficient table lookup over runtime exploration.</p>
34+
</div>
2535

2636
#### 1. Precomputation of Prefix-Conditioned Edges
2737

38+
<div style="display: flex; flex-direction: row;">
39+
<div style="width: 50%; text-align: center;">
40+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/prefix_conditioned_edge_00.png" style="zoom: 23%;" />
41+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Pre-computed transitions using stack prefixes and input symbols eliminate runtime ambiguity in parsing.</p>
42+
</div>
43+
<div style="width: 50%; text-align: center;">
44+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/automaton_construction_00.png" style="zoom: 33%;" />
45+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Generated with two types of DPDA edges (acceptance edge and reduction edge) through graph traversal.</p>
46+
</div>
47+
</div>
48+
2849
One of Pre$^3$'s key strengths is its **preprocessing stage**, where it precomputes "prefix-conditioned edges". This anticipatory analysis offers several advantages:
2950

30-
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/automaton_construction_00.png" style="zoom: 40%;" />
51+
* **Ahead-of-Time Analysis:** Unlike reactive validation, Pre$^3$ proactively analyzes all possible grammar transitions before the LLM even begins generating. This is analogous to pre-planning all possible routes and turns on a journey before setting off.
3152

32-
* **Ahead-of-Time Analysis:** Unlike reactive validation, Pre$^3$ proactively analyzes all possible grammar transitions before the LLM even begins generating. [cite_start]This is analogous to pre-planning all possible routes and turns on a journey before setting off.
53+
<div style="text-align: center;">
54+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/break_cycles_00.png" style="zoom: 40%;" />
55+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Pre³ uses a back-edge to break cycles and only one reduction edge is required.</p>
56+
</div>
3357

34-
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/overview_00.png" style="zoom: 100%;" />
58+
* **Break Cycles:** Raw DPDAs will repeatedly reduce the same sequence of symbols and create an infinite number of reduction edges. By adding the back-edge using the match and pop operations only one reduction edge is required.
59+
60+
* **Reduce-reduce or shift-reduce conflicts (collectively known as cycle issues) are a common phenomenon when parsing LR(1) grammars.** A concrete example of this phenomenon is depicted in the figure below.
61+
<div style="text-align: center;">
62+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/cycle_issue_00.png" style="zoom: 20%;" />
63+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">An example about cycle issue when a LR(1) item-set automata recieve repeated "ac".</p>
64+
</div>
65+
* **Our solution can be illustrated with an example in the figure above:** Pushdown automaton with an infinite cycle between State 1, 2, 3, 4, leading to an infinite number of possible paths and indeterminable transition paths when adding reduction edges at State 5; The back-edge from State 4 to State 1 is modified to check for complete cycle traversal information (e.g., [1, 2, 3, 4]) in the stack. If detected, it pops the redundant state (e.g., [1, 2, 3, 4]), ensuring reduction edges at State 5 only need to account for traversals without cycles.
66+
67+
<div style="text-align: center;">
68+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/parallel_check_00.png" style="zoom: 40%;" />
69+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Enabling parallel transition processing with LightLLM.</p>
70+
</div>
3571

3672
* **Enabling Parallel Transition Processing:** The deterministic nature of DPDAs, combined with precomputation, allows for parallel processing of transitions. In a non-deterministic PDA, multiple choices might exist for a given state, input, and stack top, necessitating sequential exploration. DPDA's determinism, however, ensures a unique next step, paving the way for parallel computations.
3773

@@ -41,7 +77,10 @@ One of Pre$^3$'s key strengths is its **preprocessing stage**, where it precompu
4177

4278
These conditions guarantee that at most one transition is available in any situation, making the automaton deterministic. This determinism is precisely what empowers Pre$^3$ to perform effective precomputation and parallel processing.
4379

44-
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/prefix_conditioned_edge_00.png" style="zoom: 20%;" />
80+
<div style="text-align: center;">
81+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/preprocess_00.png" style="zoom: 40%;" />
82+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Pre³ converts LR(1) grammar into a DPDA by precomputing transitions, replacing runtime exploration with efficient table lookup.</p>
83+
</div>
4584

4685
#### 2. Streamlining Automata Structure
4786

@@ -60,8 +99,10 @@ Pre$^3$ was evaluated against several state-of-the-art and popular structure gen
6099
To assess the improvement in decoding efficiency, the per-step decoding overhead was measured, defined as the difference between grammar-based decoding time and original decoding time. Experiments were conducted using Meta-Llama-3-8B and Meta-Llama-2-70B models with JSON and chain-of-thought grammars.
61100

62101
* **Key Finding:** Pre$^3$ consistently introduces less overhead than previous SOTA systems, outperforming Outlines and llama.cpp, and maintaining a consistent advantage over XGrammar. For instance, XGrammar showed up to 37.5% higher latency (147.64 ms vs. 92.23 ms) at batch size 512 compared to unconstrained decoding when evaluating on Meta-Llama-3-8B. This performance gap widened with increasing batch sizes.
63-
64-
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/overhead_00.png" style="zoom: 30%;" />
102+
<div style="text-align: center;">
103+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/overhead_00.png" style="zoom: 30%;" />
104+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Evaluation on per-step decoding efficiency.</p>
105+
</div>
65106

66107
#### Large Batch Inference Efficiency and Real-world Deployment Throughput
67108

@@ -71,9 +112,15 @@ To assess real-world performance, the throughput of Pre$^3$ and XGrammar was com
71112

72113
* **Key Finding:** Pre$^3$ consistently outperformed XGrammar in all scenarios, achieving latency reductions of up to 30%. The advantage was more pronounced at larger batch sizes, demonstrating Pre$^3$'s scalability. Pre$^3$ also demonstrated an improvement over XGrammar in real-world serving, achieving up to 20% higher throughput at higher concurrency levels.
73114

74-
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/table.png" style="zoom: 100%;" />
115+
<div style="text-align: center;">
116+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/table.png" style="zoom: 100%;" />
117+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Evaluate batch decoding efficiency.</p>
118+
</div>
75119

76-
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/serving_00.png" style="zoom: 100%;" />
120+
<div style="text-align: center;">
121+
<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/serving_00.png" style="zoom: 100%;" />
122+
<p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Benchmark under production-like serving conditions.</p>
123+
</div>
77124

78125
### Conclusion
79126

330 KB
Loading
378 KB
Loading
177 KB
Loading
298 KB
Loading
850 KB
Loading
602 KB
Loading

0 commit comments

Comments
 (0)