Speed improvement with new ROCm attention #8137
Replies: 2 comments 1 reply
-
Can you try a bigger resolution to see if it OOMs? |
Beta Was this translation helpful? Give feedback.
-
While we're at it - quad attention is significantly faster on a 1216x832 SDXL (DPM++ Sampler) than Pytorch Cross Attention and Flash Attention 2 (Triton Backend) for my 6800xt. Interestingly, I was getting better speeds with Forge than with Comfy for some reason (but Forge broke recently and i couldn't be bothered to fix it). This is on ROCM 6.4.1, latest Ubutu (modified amdgpu-dkms to compile on the 6.14 kernel) and using today's pytorch nightly for ROCM 6.4. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Since commit 08368f8 ask to share speed improvement, here I am.
More or less surprised to see there's an actual speed improvement, consider I am using Radeon VII which doesn't even have tensor core.
512x512, batch size 2, 20 steps, SDXL.
Original sub quadratic attention: 21.0 seconds
New PyTorch cross attention: 19.2 seconds
Beta Was this translation helpful? Give feedback.
All reactions