[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configurations for _chunk_cumsum_fwd_kernel
#25197
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Resolves #25194
The call to
tl.cumsum
in_chunk_cumsum_fwd_kernel
produces numerically different output depending on whetherBLOCK_H=1
vs.BLOCK_H>1
. We suspect it is using a different implementation in the former case (e.g., parallel reduction in sequence dimension) whereas in the latter case it may instead rely on parallelism in the head dimension.The auto-tuner can naturally pick different configuration each time due to fluctuations in runtime, but if it happens to choose
BLOCK_H=1
the user will see numerically different output.We propose removing this configuration from the list to remove this source of non-determinism for mamba-based models.
Test Plan
Script from #25194
Test Result
Now always produces the same output.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.