mamba2 mixer different outputs between fast, slow, training, and inference paths.

### System Info

```
- huggingface_hub version: 0.31.4
- Platform: Linux-5.15.0-140-generic-x86_64-with-glibc2.35
- Python version: 3.13.0
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /home/alberttseng/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: at676
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.9.0.dev20250803+cu129
- Jinja2: 3.1.6
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 11.2.1
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 2.2.6
- pydantic: 2.10.3
- aiohttp: 3.11.18
- hf_xet: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/alberttseng/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/alberttseng/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/alberttseng/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /home/alberttseng/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

```

### Who can help?

@ArthurZucker @Cyrilvallez 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

In https://github.com/huggingface/transformers/blob/0720e206c6ba28887e4d60ef60a6a089f6c1cc76/src/transformers/models/mamba2/modeling_mamba2.py#L656, when given the same input to `hidden_states` and setting the other arguments to `None`, I get different results for the fast path (`self.cuda_kernels_forward`) with `self.training = True`, the fast path with `self.training = False`, and the slow path. 

### Expected behavior

These should all be equivalent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mamba2 mixer different outputs between fast, slow, training, and inference paths. #41498

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

mamba2 mixer different outputs between fast, slow, training, and inference paths. #41498

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions