[TPU] Fix tpu structured decoding in mixed batches #24458

Chenyaaang · 2025-09-08T18:24:36Z

[TPU] Fix structured decoding in mixed batches. This should fix the TPU v1 test part2 soft fail.

When there are guided and non-guided requests in the same batch, scheduler_output.grammar_bitmask only contains the entries for guided requests, sort the guided requests and use cumulative index to avoid index out of bound error.

Signed-off-by: Chenyaaang <[email protected]>

gemini-code-assist

Code Review

This pull request fixes an indexing issue for structured decoding on TPUs when mixed batches of guided and non-guided requests are present. The change correctly maps the compact grammar_bitmask to the requests in the batch. However, I've identified a critical issue with how this function handles chunked prefills (sub-batching), which can lead to incorrect behavior.

gemini-code-assist · 2025-09-08T18:32:46Z

vllm/v1/worker/tpu_model_runner.py

+        sorted_struct_requests = sorted(
+            scheduler_output.structured_output_request_ids.items(),
+            key=lambda item: item[1])
+        cumulative_mask_idx = 0
+        for req_id, _ in sorted_struct_requests:
+            if req_id not in self.input_batch.req_id_to_index:
                continue
            batch_index = self.input_batch.req_id_to_index[req_id]
-            struct_out_indices.append(batch_index)
-            mask_indices.append(mask_index)
-        self.grammar_bitmask_cpu[struct_out_indices] = torch.from_numpy(
-            grammar_bitmask[mask_indices])
-        # It's not guaranteed that all requests in this batch require
-        # structured output, so create a bool tensor to represent
-        # the requests that need structured output.
-        struct_out_indices = torch.tensor(struct_out_indices, dtype=torch.long)
-        self.require_structured_out_cpu[struct_out_indices] = True
+            self.grammar_bitmask_cpu[batch_index] = torch.from_numpy(
+                grammar_bitmask[cumulative_mask_idx])
+            # It's not guaranteed that all requests in this batch require
+            # structured output, so create a bool tensor to represent
+            # the requests that need structured output.
+            self.require_structured_out_cpu[batch_index] = True
+            cumulative_mask_idx += 1


This implementation does not correctly handle chunked prefills (sub-batching). The execute_model method processes requests in chunks, but this function seems to assume it's always processing the first chunk.

The batch_index is an absolute index within the full self.input_batch, but the logits tensor and the returned bitmask tensors are relative to the current sub-batch. When processing a sub-batch that is not the first one (i.e., when start_index > 0 in execute_model), batch_index will be larger than the sub-batch size. However, the function returns self.grammar_bitmask_cpu[:num_reqs], which is a slice from the beginning of the buffer. This will result in an incorrect (likely all-zero) bitmask for requests in subsequent chunks, leading to incorrect structured decoding.

To fix this, the function needs to be aware of the current sub-batch's start_index and end_index and use relative indexing for the output tensors.

yaochengji

LGTM, thanks for the fix!

Signed-off-by: Chenyaaang <[email protected]>

Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

fix tpu structured decoding in mixed batches

ed907e7

Signed-off-by: Chenyaaang <[email protected]>

Chenyaaang requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 8, 2025 18:24

mergify bot added v1 tpu Related to Google TPUs labels Sep 8, 2025

gemini-code-assist bot reviewed Sep 8, 2025

View reviewed changes

yaochengji added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025

Chenyaaang mentioned this pull request Sep 9, 2025

Fix structured decoding and sampler vllm-project/tpu-inference#661

Merged

yaochengji approved these changes Sep 9, 2025

View reviewed changes

yaochengji enabled auto-merge (squash) September 9, 2025 17:10

vllm-bot merged commit c3f9773 into vllm-project:main Sep 9, 2025
47 of 51 checks passed

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[TPU] Fix tpu structured decoding in mixed batches (vllm-project#24458)

a48b56a

Signed-off-by: Chenyaaang <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[TPU] Fix tpu structured decoding in mixed batches (vllm-project#24458)

905f4ed

Signed-off-by: Chenyaaang <[email protected]>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[TPU] Fix tpu structured decoding in mixed batches (vllm-project#24458)

4adcfb8

Signed-off-by: Chenyaaang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[TPU] Fix tpu structured decoding in mixed batches #24458

[TPU] Fix tpu structured decoding in mixed batches #24458

Uh oh!

Chenyaaang commented Sep 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 8, 2025

Uh oh!

yaochengji left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[TPU] Fix tpu structured decoding in mixed batches #24458

[TPU] Fix tpu structured decoding in mixed batches #24458

Uh oh!

Conversation

Chenyaaang commented Sep 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

yaochengji left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Chenyaaang commented Sep 8, 2025 •

edited by github-actions bot

Loading