[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for `_chunk_cumsum_fwd_kernel` #25197

tdoublep · 2025-09-18T18:54:22Z

Purpose

Resolves #25194

The call to tl.cumsum in _chunk_cumsum_fwd_kernel produces numerically different output depending on whether BLOCK_H=1 vs. BLOCK_H>1. We suspect it is using a different implementation in the former case (e.g., parallel reduction in sequence dimension) whereas in the latter case it may instead rely on parallelism in the head dimension.

The auto-tuner can naturally pick different configuration each time due to fluctuations in runtime, but if it happens to choose BLOCK_H=1 the user will see numerically different output.

We propose removing this configuration from the list to remove this source of non-determinism for mamba-based models.

Test Plan

Script from #25194

Test Result

Now always produces the same output.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Co-authored-by: Chih-Chieh-Yang <[email protected]> Signed-off-by: Thomas Parnell <[email protected]>

gemini-code-assist

Code Review

This pull request addresses a numerical non-determinism issue in the _chunk_cumsum_fwd_kernel Triton kernel. The problem arises because tl.cumsum produces different results when BLOCK_SIZE_H=1 compared to when it's greater than 1. By removing triton.Config({'BLOCK_SIZE_H': 1}) from the autotuner's configuration list, this change effectively prevents the selection of the problematic configuration, thus ensuring deterministic output. This is a correct and well-justified fix for the issue.

…s for `_chunk_cumsum_fwd_kernel` (vllm-project#25197) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Chih-Chieh-Yang <[email protected]>

…s for `_chunk_cumsum_fwd_kernel` (#25197) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Chih-Chieh-Yang <[email protected]> Signed-off-by: yewentao256 <[email protected]>

…s for `_chunk_cumsum_fwd_kernel` (vllm-project#25197) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Chih-Chieh-Yang <[email protected]> Signed-off-by: gaojc <[email protected]>

…s for `_chunk_cumsum_fwd_kernel` (vllm-project#25197) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Chih-Chieh-Yang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

…s for `_chunk_cumsum_fwd_kernel` (vllm-project#25197) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Chih-Chieh-Yang <[email protected]>

Remove BLOCK_H=1 from list of tuneable configurations.

f7fbaf2

Co-authored-by: Chih-Chieh-Yang <[email protected]> Signed-off-by: Thomas Parnell <[email protected]>

tdoublep changed the title ~~Remove BLOCK_H=1 from list of tuneable configurations.~~ [Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for _chunk_cumsum_fwd_kernel Sep 18, 2025

gemini-code-assist bot reviewed Sep 18, 2025

View reviewed changes

mgoin added kernel ready ONLY add when PR is ready to merge/full CI is needed labels Sep 23, 2025

mgoin enabled auto-merge (squash) September 23, 2025 21:58

mgoin approved these changes Sep 23, 2025

View reviewed changes

mgoin merged commit 5e25b12 into vllm-project:main Sep 23, 2025
56 of 57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for `_chunk_cumsum_fwd_kernel` #25197

[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for `_chunk_cumsum_fwd_kernel` #25197

Uh oh!

tdoublep commented Sep 18, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for _chunk_cumsum_fwd_kernel #25197

[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for _chunk_cumsum_fwd_kernel #25197

Uh oh!

Conversation

tdoublep commented Sep 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for `_chunk_cumsum_fwd_kernel` #25197

[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for `_chunk_cumsum_fwd_kernel` #25197

tdoublep commented Sep 18, 2025 •

edited by github-actions bot

Loading