-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[FEAT][ROCm]: Support AITER MLA on V1 Engine #17523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: qli88 <[email protected]> Signed-off-by: vllmellm <[email protected]>
Co-authored-by: qli88 <[email protected]> Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Co-authored-by: ArthurAMD [email protected] Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
… if/else statements Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
…tention selector backend Signed-off-by: vllmellm <[email protected]>
POLLING_TIMEOUT_S = POLLING_TIMEOUT_MS // 1000 | ||
|
||
EXECUTE_MODEL_TIMEOUT_S = 40 | ||
EXECUTE_MODEL_TIMEOUT_S = (envs.VLLM_ROCM_EXECUTE_MODEL_TIMEOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering why rocm needs a much larger timeout here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on first time run when the graph is being created it might take between 100-250 seconds depending on how many AITER kernels are enabled. Thus we kept the default timeout to 250s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not crazy about requiring another environment variable when running AITER. Can you just set the timeout to 250 here instead of asking the user to increase the timeout? Feel free to give a "safe" timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these very long runs only happen during the profile and graph capture runs? Or can they happen while processing real requests?
Could you check the failed tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve with comment.
To make Deepseek V1 performant, it needs additional work.
Based on my test, it can improve TTFT if additional AITER environment variables are used. Otherwise, the TTFT is not as good comparing to V0. Throughput is not good yet comparing to V0.
compute_slot_mapping_start_idx, | ||
is_block_tables_empty) | ||
from vllm.attention.ops.rocm_aiter_mla import (aiter_mla_decode_fwd, | ||
from vllm.attention.ops.rocm_aiter_mla import (aiter_mla_decode_forward, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this change from fwd -> forward seems not necessary, in order to minimize the number of files changed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it has been addressed in the latest commit.
vllm/attention/ops/rocm_aiter_mla.py
Outdated
|
||
|
||
def aiter_mla_decode_fwd( | ||
def aiter_mla_decode_forward( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep the name as _fwd (see below line 37 decode_fwd)
@houseroad I found using below environment variables can fix the huge TTFT issue described in the PR.
My command is below:
For input-len/output-len/concurrency/prompts 1000/1000/1/2, the TTFT is changed from 57419.91 to 154.8. |
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Co-authored-by: Hongxia Yang <[email protected]>
Signed-off-by: vllmellm <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. Just a few nits and questions
from aiter import flash_attn_varlen_func | ||
self.flash_attn_varlen_func = flash_attn_varlen_func | ||
|
||
def _flash_attn_varlen_diff_headdims(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SageMoore you may want to check coomon.py
the method _flash_attn_varlen_diff_headdims
is defined there and overridden in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now. I must have mistyped the string when I searched for it :).
assert max_model_len == 32768,\ | ||
"AITER MLA requires max_model_len=32768" | ||
assert self.runner.block_size == 1, "AITER MLA" \ | ||
"requires only block size 1." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "only supports block size 1."
POLLING_TIMEOUT_S = POLLING_TIMEOUT_MS // 1000 | ||
|
||
EXECUTE_MODEL_TIMEOUT_S = 40 | ||
EXECUTE_MODEL_TIMEOUT_S = (envs.VLLM_ROCM_EXECUTE_MODEL_TIMEOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not crazy about requiring another environment variable when running AITER. Can you just set the timeout to 250 here instead of asking the user to increase the timeout? Feel free to give a "safe" timeout.
vllm/envs.py
Outdated
lambda: int(os.getenv("VLLM_RPC_TIMEOUT", "10000")), | ||
|
||
# Time in seconds for the model execution in ROCm platforms. | ||
"VLLM_ROCM_EXECUTE_MODEL_TIMEOUT": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this. See below comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SageMoore At this moment we can't find "safe" timeout because depending on number of AITER kernels are enable knowing the "safe" timeout is difficult as tracing the AITER jit files might be time consuming during execution time and might change as AITER ops might change based on different versions would be used in future. Thus, rather than having a hardcoded timeout that might trouble the user where to spot it in the code they are able to control this value with environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SageMoore The environment variable has been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SageMoore : the env change is removed as we discussed. Please merge this asap if there are no other blockers.
# `context_chunk_starts` that are not aligned to page_size | ||
max_context_chunk = round_down(max_context_chunk, | ||
self.page_size) | ||
if self.aot_schedule: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain this a bit? Why was this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SageMoore the self.page_size
if only defined in __init__
with the condition self.aot_schedule
while on ROCm this condition is not true and it encounters the error self.page_size is not defined
.
vllm/vllm/v1/attention/backends/mla/common.py
Lines 355 to 359 in ba7703e
self.aot_schedule = is_vllm_fa and (get_flash_attn_version() == 3) | |
# Dont try to access the runner on AMD | |
if self.aot_schedule: | |
self.page_size = self.runner.block_size |
You may want to ask the author about this as these line changes were added in this PR.
anyways if self.page_size
is defined without this self.aot_schedule
condition it does not have any effect on ROCm at least for AITER MLA which is the only MLA backend in V1 currently.
Signed-off-by: vllmellm <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
Signed-off-by: vllmellm <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable. Thanks for taking out the timeout changes!
It seems that this PR introduced a static check error. https://github.com/vllm-project/vllm/actions/runs/14923384538/job/41922811885?pr=17845
|
Signed-off-by: vllmellm <[email protected]> Co-authored-by: qli88 <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Signed-off-by: 汪志鹏 <[email protected]>
fixed by #17880 |
Signed-off-by: vllmellm <[email protected]> Co-authored-by: qli88 <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Signed-off-by: vllmellm <[email protected]> Co-authored-by: qli88 <[email protected]> Co-authored-by: Hongxia Yang <[email protected]>
Signed-off-by: vllmellm <[email protected]> Co-authored-by: qli88 <[email protected]> Co-authored-by: Hongxia Yang <[email protected]>
Signed-off-by: vllmellm <[email protected]> Co-authored-by: qli88 <[email protected]> Co-authored-by: Hongxia Yang <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
AITER MLA Support for V1 Engine
This PR implements AITER MLA attention backend support for the V1 engine. The implementation mirrors the V0 engine's established approach from PR #15893.
This PR also introduces a new environment variable,
VLLM_ROCM_EXECUTE_MODEL_TIMEOUT
, which specifies the model execution timeout in seconds. This allows for flexible adjustment of execution time, which is helpful since a timeout error was encountered during graph building when enabling AITER MLA ops on the v1 engine.Accuracy Validation
using the command below:
VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=1 lm_eval \ --model vllm \ --model_args pretrained=deepseek-ai/DeepSeek-V3,tensor_parallel_size=8,trust_remote_code=True,max_model_len=32768,block_size=1,enforce_eager=False \ --tasks gsm8k --num_fewshot 5 --batch_size auto
Results:
Performance:
The results of
benchmarks/benchmark_serving.py
using the commands below:
v0 engine =
VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=0 python benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-V3 --trust-remote-code --dataset-name random
v1 engine =
VLLM_ATTENTION_BACKEND=ROCM_AITER_MLA VLLM_USE_V1=1 python benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-V3 --trust-remote-code --dataset-name random