Use Cache Hinting for fused_moe kernel #15511

wrmedford · 2025-03-26T01:40:52Z

Forward compatible change with triton-lang/triton#6278. Benchmark below is for when it's merged.

Before

============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  116.83
Total input tokens:                      50000
Total generated tokens:                  500000
Request throughput (req/s):              4.28
Output token throughput (tok/s):         4279.83
Total Token throughput (tok/s):          4707.81
---------------Time to First Token----------------
Mean TTFT (ms):                          161.36
Median TTFT (ms):                        133.79
P99 TTFT (ms):                           1117.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.26
Median TPOT (ms):                        72.86
P99 TPOT (ms):                           75.91
---------------Inter-token Latency----------------
Mean ITL (ms):                           72.19
Median ITL (ms):                         72.33
P99 ITL (ms):                            116.49
==================================================

After

============ Serving Benchmark Result ============
Successful requests:                     500
Benchmark duration (s):                  116.62
Total input tokens:                      50000
Total generated tokens:                  500000
Request throughput (req/s):              4.29
Output token throughput (tok/s):         4287.33
Total Token throughput (tok/s):          4716.07
---------------Time to First Token----------------
Mean TTFT (ms):                          141.31
Median TTFT (ms):                        131.75
P99 TTFT (ms):                           469.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          71.41
Median TPOT (ms):                        72.11
P99 TPOT (ms):                           74.47
---------------Inter-token Latency----------------
Mean ITL (ms):                           71.34
Median ITL (ms):                         71.13
P99 ITL (ms):                            111.78
==================================================

Slight improvement, mostly in latency metrics.

Interface already exists, so this is safe to merge now.

Big thanks to @Apsu on all of the help here!

github-actions · 2025-03-26T01:41:00Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Wes Medford <[email protected]>

youkaichao · 2025-03-26T14:08:02Z

@LucasWilkinson @bnellnm can you help review?

LucasWilkinson · 2025-03-26T16:45:53Z

Seems straight forward enough, this shouldn't cause any issue on non-Nvidia hardware? i.e. these parameters are just ignored right?

bnellnm

lgtm

LucasWilkinson

LGTM assuming:

this shouldn't cause any issue on non-Nvidia hardware? i.e. these parameters are just ignored right?

wrmedford · 2025-03-26T18:16:10Z

Seems straight forward enough, this shouldn't cause any issue on non-Nvidia hardware? i.e. these parameters are just ignored right?

Afaik these bindings have existed in triton for a while and it's up to the backend to implement them. Otherwise they're ignored.

DefTruth · 2025-03-27T13:02:13Z

The latest stable version of triton dont contains: triton-lang/triton#6278

This reverts commit 7a88827. Signed-off-by: Wes Medford <[email protected]>

Signed-off-by: Wes Medford <[email protected]>

Signed-off-by: xinyuxiao <[email protected]>

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]> Signed-off-by: xinyuxiao <[email protected]>

Signed-off-by: Louis Ulmer <[email protected]>

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]>

Signed-off-by: Mu Huai <[email protected]>

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]> Signed-off-by: Mu Huai <[email protected]>

wrmedford added 2 commits March 25, 2025 19:55

(feat) add cache hints to experts in fused_moe kernel

8a45c2b

Signed-off-by: Wes Medford <[email protected]>

(lint) fix formatting

b7a2b72

Signed-off-by: Wes Medford <[email protected]>

wrmedford force-pushed the main branch from 77a57e8 to b7a2b72 Compare March 26, 2025 01:55

bnellnm approved these changes Mar 26, 2025

View reviewed changes

LucasWilkinson approved these changes Mar 26, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) March 26, 2025 19:58

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 26, 2025

LucasWilkinson merged commit 7a88827 into vllm-project:main Mar 26, 2025
41 of 42 checks passed

Qubitium mentioned this pull request Mar 27, 2025

[Bug]: Triton JIT Compile Regression from PR 15511 #15619

Closed

1 task

wrmedford added a commit to wrmedford/vllm that referenced this pull request Mar 27, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)"

2192356

This reverts commit 7a88827. Signed-off-by: Wes Medford <[email protected]>

vllm-bot pushed a commit that referenced this pull request Mar 28, 2025

Revert "Use Cache Hinting for fused_moe kernel (#15511)" (#15645)

4ae17bf

Signed-off-by: Wes Medford <[email protected]>

Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025

Use Cache Hinting for fused_moe kernel (vllm-project#15511)

463c71a

Signed-off-by: xinyuxiao <[email protected]>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

Use Cache Hinting for fused_moe kernel (vllm-project#15511)

0a9f2e4

Signed-off-by: Louis Ulmer <[email protected]>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)" (v…

4771a27

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

Use Cache Hinting for fused_moe kernel (vllm-project#15511)

56a2eca

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)" (v…

93b1eac

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]>

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

Use Cache Hinting for fused_moe kernel (vllm-project#15511)

7156dad

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)" (v…

bf69a2f

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

Use Cache Hinting for fused_moe kernel (vllm-project#15511)

3ead902

Signed-off-by: Mu Huai <[email protected]>

RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025

Revert "Use Cache Hinting for fused_moe kernel (vllm-project#15511)" (v…

2558499

…llm-project#15645) Signed-off-by: Wes Medford <[email protected]> Signed-off-by: Mu Huai <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use Cache Hinting for fused_moe kernel #15511

Use Cache Hinting for fused_moe kernel #15511

Uh oh!

wrmedford commented Mar 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

youkaichao commented Mar 26, 2025

Uh oh!

LucasWilkinson commented Mar 26, 2025

Uh oh!

bnellnm left a comment

Uh oh!

LucasWilkinson left a comment

Uh oh!

wrmedford commented Mar 26, 2025

Uh oh!

Uh oh!

DefTruth commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Use Cache Hinting for fused_moe kernel #15511

Use Cache Hinting for fused_moe kernel #15511

Uh oh!

Conversation

wrmedford commented Mar 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 26, 2025

Uh oh!

youkaichao commented Mar 26, 2025

Uh oh!

LucasWilkinson commented Mar 26, 2025

Uh oh!

bnellnm left a comment

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

wrmedford commented Mar 26, 2025

Uh oh!

Uh oh!

DefTruth commented Mar 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wrmedford commented Mar 26, 2025 •

edited by github-actions bot

Loading