-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again #24771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adjusts the threshold for a flaky test related to n-gram speculative decoding to fix CI failures. My review highlights a concern that this is a temporary fix for a potentially deeper issue. The test uses deterministic sampling parameters (temperature=0
) but shows non-deterministic behavior, which could point to a bug. I've added a comment recommending an investigation into the root cause of the flakiness rather than repeatedly lowering the test's threshold, as this could hide future regressions.
tests/v1/e2e/test_spec_decode.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While lowering the threshold might fix the immediate flakiness in CI, this is the second time this has been done for this test. This pattern of repeatedly reducing the threshold is concerning as it can mask underlying bugs or performance regressions in the speculative decoding implementation.
Given that temperature=0
is used, the model's output should be deterministic. The fact that there are discrepancies between the reference and speculative outputs suggests a potential issue.
Instead of just lowering the threshold, it would be better to investigate the root cause of this non-determinism. Possible areas to investigate:
- Are there any non-deterministic operations in the model or CUDA kernels that are not properly seeded or configured for deterministic execution?
- Could there be floating-point precision differences that are causing the outputs to diverge?
- Is there a subtle bug in the n-gram speculative decoding logic that causes it to not be equivalent to the reference implementation in some cases?
Addressing the root cause would lead to a more robust and reliable test.
Signed-off-by: wwl2755 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed with you!
But actually I think ngram proposer would only run on CPU and should not generate any randomness? I can open up a issue to track it. (#24777) |
Retrying flaky This test didn't fail that much before, I wonder what changed? |
It passed this time |
Thanks for retrying that! I got no idea what exactly caused this. It seems something randomly went wrong during/after loading the weights. Anyway, I think it shouldn't be related to the change in this PR. 😂 |
… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]>
… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]> Signed-off-by: bbartels <[email protected]>
… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]>
… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]>
tests/v1/e2e/test_spec_decode.py::test_ngram_correctness
is still flaky after #24528Example failures at main:
https://buildkite.com/vllm/ci/builds/30544#01993eac-288c-4699-af07-f991b95918c0,
https://buildkite.com/vllm/ci/builds/30529#01993e6c-3c5f-4de7-a7b2-cdd4ca680d6d,
https://buildkite.com/vllm/ci/builds/30507#01993dac-fe3a-4f99-b106-97c65b59a783,
https://buildkite.com/vllm/ci/builds/30434#01993aeb-c38f-4277-b736-f740bf4e548a
I ran it locally multiple times (>5) and the least number I have seen was 66.
Related: #24314
CC: @njhill @markmc @ekagra-ranjan @WoosukKwon @LiuXiaoxuanPKU