Skip to content

Conversation

tdoublep
Copy link
Member

@tdoublep tdoublep commented Sep 18, 2025

Purpose

Resolves #25194

The call to tl.cumsum in _chunk_cumsum_fwd_kernel produces numerically different output depending on whether BLOCK_H=1 vs. BLOCK_H>1. We suspect it is using a different implementation in the former case (e.g., parallel reduction in sequence dimension) whereas in the latter case it may instead rely on parallelism in the head dimension.

The auto-tuner can naturally pick different configuration each time due to fluctuations in runtime, but if it happens to choose BLOCK_H=1 the user will see numerically different output.

We propose removing this configuration from the list to remove this source of non-determinism for mamba-based models.

Test Plan

Script from #25194

Test Result

Now always produces the same output.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Co-authored-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
@tdoublep tdoublep changed the title Remove BLOCK_H=1 from list of tuneable configurations. [Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for _chunk_cumsum_fwd_kernel Sep 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a numerical non-determinism issue in the _chunk_cumsum_fwd_kernel Triton kernel. The problem arises because tl.cumsum produces different results when BLOCK_SIZE_H=1 compared to when it's greater than 1. By removing triton.Config({'BLOCK_SIZE_H': 1}) from the autotuner's configuration list, this change effectively prevents the selection of the problematic configuration, thus ensuring deterministic output. This is a correct and well-justified fix for the issue.

@mgoin mgoin added kernel ready ONLY add when PR is ready to merge/full CI is needed labels Sep 23, 2025
@mgoin mgoin enabled auto-merge (squash) September 23, 2025 21:58
@mgoin mgoin merged commit 5e25b12 into vllm-project:main Sep 23, 2025
56 of 57 checks passed
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…s for `_chunk_cumsum_fwd_kernel` (vllm-project#25197)

Signed-off-by: Thomas Parnell <[email protected]>
Co-authored-by: Chih-Chieh-Yang <[email protected]>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…s for `_chunk_cumsum_fwd_kernel` (#25197)

Signed-off-by: Thomas Parnell <[email protected]>
Co-authored-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: yewentao256 <[email protected]>
gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025
…s for `_chunk_cumsum_fwd_kernel` (vllm-project#25197)

Signed-off-by: Thomas Parnell <[email protected]>
Co-authored-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: gaojc <[email protected]>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…s for `_chunk_cumsum_fwd_kernel` (vllm-project#25197)

Signed-off-by: Thomas Parnell <[email protected]>
Co-authored-by: Chih-Chieh-Yang <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
…s for `_chunk_cumsum_fwd_kernel` (vllm-project#25197)

Signed-off-by: Thomas Parnell <[email protected]>
Co-authored-by: Chih-Chieh-Yang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kernel ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: vLLM produces non-deterministic output due to Triton autotuner

2 participants