support flash-attn at torch backend #2257

pass-lin · 2025-05-17T07:29:02Z

restart from #2189
Let's try to make torch run flash attn together.

pass-lin · 2025-05-19T06:49:50Z

This bug doesn't seem to be relevant to me because I haven't made any relevant changes
@sachinprasadhs

mattdangerw · 2025-05-20T15:34:37Z

@divyashreepathihalli can you take a look at this one?

divyashreepathihalli · 2025-05-25T04:38:34Z

keras_hub/src/models/mixtral/mixtral_attention.py

-        self._num_key_value_heads = num_key_value_heads
-        self._sliding_window = sliding_window
-        self._dropout = dropout
+        self.num_query_heads = num_query_heads


what is the reason behind the renaming?

what is the reason behind the renaming?

https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/mixtral/mixtral_attention.py
I'm just synchronizing it to the current repository here.

Oh okay, can you rebase your branch with master so that these dont show up as new changes

divyashreepathihalli · 2025-05-25T04:40:59Z

keras_hub/src/models/mixtral/mixtral_attention.py


    def _use_fused_attention_op(self):
        if not fused_attention_op_available():
            return False
        if self.dropout > 0.0:
            return False
        if running_on_gpu():
-            # GPU never supports softcap in the fused op.
-            if self.logit_soft_cap is not None:


this needs to return false in JAX backend.

this needs to return false in JAX backend.

mixtral never use self.logit_soft_cap? so I can not get your mean.

I see! okay

divyashreepathihalli · 2025-05-25T04:46:04Z

keras_hub/src/utils/keras_utils.py

@@ -71,6 +71,23 @@ def fused_attention_op_available():
            )
            return False
        return True
+    elif (


this looks good! Can you please enable this
https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/gemma/gemma_causal_lm_test.py#L101
in PyTorch backend and make sure the tests pass in the supported GPU - ( this may not be supported on T4-which our CI tests use, so a demo colab showing the tests passing on a supported GPU would be great)

this looks good! Can you please enable this https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/gemma/gemma_causal_lm_test.py#L101 in PyTorch backend and make sure the tests pass in the supported GPU - ( this may not be supported on T4-which our CI tests use, so a demo colab showing the tests passing on a supported GPU would be great)

These are models that reference the fused_attention_op_available() function.
Here are the test results of A100.

@pass-lin the test has not been enabled on Pytorch backend. Can you please refer to the above comment on enabling it.

@pass-lin the test has not been enabled on Pytorch backend. Can you please refer to the above comment on enabling it.

I don't know if you have tested it on a100. At present, the gemma and gemma3 test code flash attn fails. This is true for both jax and torch.
I propose, can you design tests on models like qwen and llama that are more suitable for flash-attn?

@pctablet505 - have you tested this? can you please take a look?

I'm not sure about it, I'll have to look into it

@pctablet505 - have you tested this? can you please take a look?

@pctablet505 @divyashreepathihalli
I can make sure this test is wrong, because it is testing gemma2, and gemm2 does not support flash-attn.

@pass-lin
I just verified that Gemma2 and Gemma3 can't support Flash_attention on A100 GPU.
Gemma3 can use flash attention on TPU or GPUs with cuda compute capability >=9.0 that is H series or latter. For example H100

#21333

divyashreepathihalli · 2025-05-25T04:46:44Z

Thanks for the PR, I left some comments.

support flash-attn at torch backend

bcc0f22

pass-lin mentioned this pull request May 17, 2025

implement of leftpadding #2242

Merged

pass-lin added 3 commits May 17, 2025 16:42

fix

faf8ffb

fix

6bba5ae

fix

0f960b8

sachinprasadhs added the kokoro:force-run Runs Tests on GPU label May 19, 2025

kokoro-team removed the kokoro:force-run Runs Tests on GPU label May 19, 2025

pass-lin added 7 commits May 20, 2025 21:59

fix conflit

b4dcc7f

fix conflit

72f4260

fix conflit

6ce366d

fix conflit

16c4541

fix conflit

78f2c06

fix conflit

52336ac

format

edbee6f

mattdangerw requested a review from divyashreepathihalli May 20, 2025 15:34

divyashreepathihalli requested changes May 25, 2025

View reviewed changes

Merge branch 'keras-team:master' into master

5c7f11f

support flash-attn at torch backend #2257

Are you sure you want to change the base?

support flash-attn at torch backend #2257

Uh oh!

Conversation

pass-lin commented May 17, 2025

Uh oh!

pass-lin commented May 19, 2025

Uh oh!

mattdangerw commented May 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pass-lin May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pass-lin May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pass-lin May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pctablet505 May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyashreepathihalli commented May 25, 2025

Uh oh!

Uh oh!

pass-lin May 25, 2025 •

edited

Loading

pass-lin May 25, 2025 •

edited

Loading

pass-lin May 28, 2025 •

edited

Loading

pctablet505 May 30, 2025 •

edited

Loading