ModelTC
diff --git a/‎.DS_Store‎
0 Bytes b/‎.DS_Store‎
0 Bytes
diff --git a/‎_posts/2025-06-15-pre3.md‎
Lines changed: 56 additions & 9 deletions b/‎_posts/2025-06-15-pre3.md‎
Lines changed: 56 additions & 9 deletions
diff --git a/‎assets/images/blogs/03-pre3/break_cycles_00.png‎
330 KB b/‎assets/images/blogs/03-pre3/break_cycles_00.png‎
330 KB
diff --git a/‎assets/images/blogs/03-pre3/constrained_decoding_00.png‎
378 KB b/‎assets/images/blogs/03-pre3/constrained_decoding_00.png‎
378 KB
diff --git a/‎assets/images/blogs/03-pre3/cycle_issue_00.png‎
177 KB b/‎assets/images/blogs/03-pre3/cycle_issue_00.png‎
177 KB
diff --git a/‎assets/images/blogs/03-pre3/parallel_check_00.png‎
298 KB b/‎assets/images/blogs/03-pre3/parallel_check_00.png‎
298 KB
diff --git a/‎assets/images/blogs/03-pre3/preprocess_00.png‎
850 KB b/‎assets/images/blogs/03-pre3/preprocess_00.png‎
850 KB
diff --git a/‎assets/images/blogs/03-pre3/workflow_00.png‎
602 KB b/‎assets/images/blogs/03-pre3/workflow_00.png‎
602 KB
@@ -10,6 +10,11 @@ mathjax: true
 
 The ubiquitous rise of Large Language Models (LLMs) has amplified the demand for efficient structured content generation, from code to structured data formats like JSON. While existing methods for generating outputs conforming to specific grammars (like LR(1) grammars) have enabled impressive capabilities, they often introduce significant runtime overhead, particularly under large inference batching scenarios. This post introduces **Pre$^3$**, a novel approach that leverages Deterministic Pushdown Automata (DPDA) to revolutionize the speed and efficiency of structured LLM generation.
 
+<div style="text-align: center;">
+  <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/constrained_decoding_00.png"  style="zoom: 30%;" />
+  <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Structured generation means LLM produces formatted data (JSON, SQL, code) instead of free text, enforce strict syntax rules for machine-readable output.</p>
+</div>
+
 ### The Bottleneck of Current Structured LLM Generation
 
 Current state-of-the-art methods for constrained LLM decoding typically involve parsing LR(1) grammars into Pushdown Automata (PDAs). While effective, this process incurs substantial runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. This sequential and context-aware validation becomes a significant bottleneck, akin to navigating a complex maze one step at a time, checking every turn.
@@ -21,17 +26,48 @@ Current state-of-the-art methods for constrained LLM decoding typically involve
 
 ### Pre$^3$: A Paradigm Shift with Deterministic Pushdown Automata
 
-Pre$^3$ addresses these efficiency limitations by exploiting the power of **Deterministic Pushdown Automata (DPDA)**. The core innovation lies in transforming the traditional, often non-deterministic, PDA-based decoding into a highly optimized, deterministic process.
+Pre$^3$ addresses these efficiency limitations by exploiting the power of **Deterministic Pushdown Automata (DPDA)**. The core innovation lies in transforming the traditional, often non-deterministic, PDA-based decoding into a highly optimized, deterministic process. The workflow of Pre$^3$ is illustrated in the figure below.
+
+<div style="text-align: center;">
+  <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/workflow_00.png"  style="zoom: 60%;" />
+  <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Pre³ Workflow outlines our approach to convert LR(1) grammars into DPDAs by precomputing all transitions, enabling efficient table lookup over runtime exploration.</p>
+</div>
 
 #### 1. Precomputation of Prefix-Conditioned Edges
 
+<div style="display: flex; flex-direction: row;">
+  <div style="width: 50%; text-align: center;">
+    <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/prefix_conditioned_edge_00.png"  style="zoom: 23%;" />
+    <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Pre-computed transitions using stack prefixes and input symbols eliminate runtime ambiguity in parsing.</p>
+  </div>
+  <div style="width: 50%; text-align: center;">
+    <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/automaton_construction_00.png"  style="zoom: 33%;" />
+    <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Generated with two types of DPDA edges (acceptance edge and reduction edge) through graph traversal.</p>
+  </div>
+</div>
+
 One of Pre$^3$'s key strengths is its **preprocessing stage**, where it precomputes "prefix-conditioned edges". This anticipatory analysis offers several advantages:
 
-<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/automaton_construction_00.png"  style="zoom: 40%;" />
+* **Ahead-of-Time Analysis:** Unlike reactive validation, Pre$^3$ proactively analyzes all possible grammar transitions before the LLM even begins generating. This is analogous to pre-planning all possible routes and turns on a journey before setting off.
 
-* **Ahead-of-Time Analysis:** Unlike reactive validation, Pre$^3$ proactively analyzes all possible grammar transitions before the LLM even begins generating. [cite_start]This is analogous to pre-planning all possible routes and turns on a journey before setting off.
+<div style="text-align: center;">
+  <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/break_cycles_00.png"  style="zoom: 40%;" />
+  <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Pre³ uses a back-edge to break cycles and only one reduction edge is required.</p>
+</div>
 
-<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/overview_00.png"  style="zoom: 100%;" />
+* **Break Cycles:** Raw DPDAs will repeatedly reduce the same sequence of symbols and create an infinite number of reduction edges. By adding the back-edge using the match and pop operations only one reduction edge is required.
+
+    * **Reduce-reduce or shift-reduce conflicts (collectively known as cycle issues) are a common phenomenon when parsing LR(1) grammars.** A concrete example of this phenomenon is depicted in the figure below.
+    <div style="text-align: center;">
+      <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/cycle_issue_00.png"  style="zoom: 20%;" />
+      <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">An example about cycle issue when a LR(1) item-set automata recieve repeated "ac".</p>
+    </div>
+    * **Our solution can be illustrated with an example in the figure above:** Pushdown automaton with an infinite cycle between State 1, 2, 3, 4, leading to an infinite number of possible paths and indeterminable transition paths when adding reduction edges at State 5; The back-edge from State 4 to State 1 is modified to check for complete cycle traversal information (e.g., [1, 2, 3, 4]) in the stack. If detected, it pops the redundant state (e.g., [1, 2, 3, 4]), ensuring reduction edges at State 5 only need to account for traversals without cycles.
+
+<div style="text-align: center;">
+  <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/parallel_check_00.png"  style="zoom: 40%;" />
+  <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Enabling parallel transition processing with LightLLM.</p>
+</div>
 
 * **Enabling Parallel Transition Processing:** The deterministic nature of DPDAs, combined with precomputation, allows for parallel processing of transitions. In a non-deterministic PDA, multiple choices might exist for a given state, input, and stack top, necessitating sequential exploration. DPDA's determinism, however, ensures a unique next step, paving the way for parallel computations.
 
@@ -41,7 +77,10 @@ One of Pre$^3$'s key strengths is its **preprocessing stage**, where it precompu
 
     These conditions guarantee that at most one transition is available in any situation, making the automaton deterministic. This determinism is precisely what empowers Pre$^3$ to perform effective precomputation and parallel processing.
 
-    <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/prefix_conditioned_edge_00.png"  style="zoom: 20%;" />
+<div style="text-align: center;">
+  <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/preprocess_00.png"  style="zoom: 40%;" />
+  <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Pre³ converts LR(1) grammar into a DPDA by precomputing transitions, replacing runtime exploration with efficient table lookup.</p>
+</div>
 
 #### 2. Streamlining Automata Structure
 
@@ -60,8 +99,10 @@ Pre$^3$ was evaluated against several state-of-the-art and popular structure gen
 To assess the improvement in decoding efficiency, the per-step decoding overhead was measured, defined as the difference between grammar-based decoding time and original decoding time. Experiments were conducted using Meta-Llama-3-8B and Meta-Llama-2-70B models with JSON and chain-of-thought grammars.
 
 * **Key Finding:** Pre$^3$ consistently introduces less overhead than previous SOTA systems, outperforming Outlines and llama.cpp, and maintaining a consistent advantage over XGrammar. For instance, XGrammar showed up to 37.5% higher latency (147.64 ms vs. 92.23 ms) at batch size 512 compared to unconstrained decoding when evaluating on Meta-Llama-3-8B. This performance gap widened with increasing batch sizes.
-
-<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/overhead_00.png"  style="zoom: 30%;" />
+<div style="text-align: center;">
+  <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/overhead_00.png"  style="zoom: 30%;" />
+  <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Evaluation on per-step decoding efficiency.</p>
+</div>
 
 #### Large Batch Inference Efficiency and Real-world Deployment Throughput
 
@@ -71,9 +112,15 @@ To assess real-world performance, the throughput of Pre$^3$ and XGrammar was com
 
 * **Key Finding:** Pre$^3$ consistently outperformed XGrammar in all scenarios, achieving latency reductions of up to 30%. The advantage was more pronounced at larger batch sizes, demonstrating Pre$^3$'s scalability. Pre$^3$ also demonstrated an improvement over XGrammar in real-world serving, achieving up to 20% higher throughput at higher concurrency levels.
 
-<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/table.png"  style="zoom: 100%;" />
+<div style="text-align: center;">
+  <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/table.png"  style="zoom: 100%;" />
+  <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Evaluate batch decoding efficiency.</p>
+</div>
 
-<img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/serving_00.png"  style="zoom: 100%;" />
+<div style="text-align: center;">
+  <img src="{{ site.baseurl }}/assets/images/blogs/03-pre3/serving_00.png"  style="zoom: 100%;" />
+  <p style="font-family: sans-serif; font-size: 0.9em; color: #555;">Benchmark under production-like serving conditions.</p>
+</div>
 
 ### Conclusion