Skip to content

Conversation

admitric
Copy link

This PR adds a FlexAttention autotuner config that can show better performance on one of the shapes.

@anmyachev
Copy link
Contributor

@chengjunlu @whitneywhtsang could you take a look?

@whitneywhtsang
Copy link
Contributor

whitneywhtsang commented Oct 15, 2025

@admitric I tried to perform performance measurement with the additional config on PVC and BMG, but noticed no performance difference on default runners. I suspect the performance improvement can only be observed with updated driver. Will kick off a run to prove it when runners with updated drivers are available. Where do you expect the performance improvement? BMG?

@whitneywhtsang whitneywhtsang self-requested a review October 15, 2025 14:50
@admitric
Copy link
Author

PVC on shape [1, 128, 128, 1024, 1024, 192, 128]

@whitneywhtsang
Copy link
Contributor

whitneywhtsang commented Oct 15, 2025

PVC on shape [1, 128, 128, 1024, 1024, 192, 128]

@admitric No performance improvement is observed with agama 1188, do we also need vectorization enabled?
On PVC:
Before this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/18537552569
After this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/18536568349

Z H_q H_kv N_CTX_q N_CTX_kv D_HEAD_qk D_HEAD_v MODE Before After Ratio
1 32 32 1024 1024 96 96 fwd 40.02019 39.65073 0.990768
1 128 128 1024 1024 192 128 fwd 68.04017 67.29814 0.989094
1 32 32 512 1664 96 96 fwd 93.87538 93.33971 0.994294
1 128 1 512 1664 64 512 fwd 69.27902 61.22093 0.883686
1 32 8 1024 1024 128 128 fwd 50.84005 50.93652 1.001897
1 28 4 1024 1024 128 128 fwd 46.50534 46.96446 1.009873
1 32 8 512 1664 128 128 fwd 122.187 121.591 0.995122
1 28 4 512 1664 128 128 fwd 113.9348 112.7984 0.990026
1 32 8 1 1088 128 128 fwd 0.371991 0.36983 0.994191

@chengjunlu
Copy link
Contributor

Changes looks good to me.

@admitric
Copy link
Author

I have different results on PVC, they reproduce stabily. I am using Triton main branch (on 2908846).
My data (machine DUT1005-PVC):
hardcoded FlexConfig(128, 64, 2, 8), agama-1188, spill size 1344

   Z  H_q  H_kv  N_CTX_q  N_CTX_kv  D_HEAD_qk  D_HEAD_v MODE  Triton-GB/s  Torch-GB/s  Triton-GB/s-min  Torch-GB/s-min  Triton-GB/s-max  Torch-GB/s-max  Triton-TFlops  Torch-TFlops  Triton-TFlops-min  Torch-TFlops-min  Triton-TFlops-max  Torch-TFlops-max  Triton-CV  Torch-CV
0  1  128   128     1024      1024        192       128  fwd   133.007351   10.326847       129.975336        10.23494       135.321954        10.51606      42.562352      3.304591          41.592107          3.275181          43.303025          3.365139   0.011254  0.007157

hardcoded [FlexConfig(64, 32, 2, 4)], agama-1188, spill size 0

   Z  H_q  H_kv  N_CTX_q  N_CTX_kv  D_HEAD_qk  D_HEAD_v MODE  Triton-GB/s  Torch-GB/s  Triton-GB/s-min  Torch-GB/s-min  Triton-GB/s-max  Torch-GB/s-max  Triton-TFlops  Torch-TFlops  Triton-TFlops-min  Torch-TFlops-min  Triton-TFlops-max  Torch-TFlops-max  Triton-CV  Torch-CV
0  1  128   128     1024      1024        192       128  fwd   172.529671   10.363806       166.308649       10.237813       175.824937       10.468552      55.209495      3.316418          53.218768            3.2761           56.26398          3.349937   0.014669  0.005706

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants