[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again #24771

wwl2755 · 2025-09-12T18:14:57Z

tests/v1/e2e/test_spec_decode.py::test_ngram_correctness is still flaky after #24528

Example failures at main:
https://buildkite.com/vllm/ci/builds/30544#01993eac-288c-4699-af07-f991b95918c0,
https://buildkite.com/vllm/ci/builds/30529#01993e6c-3c5f-4de7-a7b2-cdd4ca680d6d,
https://buildkite.com/vllm/ci/builds/30507#01993dac-fe3a-4f99-b106-97c65b59a783,
https://buildkite.com/vllm/ci/builds/30434#01993aeb-c38f-4277-b736-f740bf4e548a

I ran it locally multiple times (>5) and the least number I have seen was 66.

Related: #24314

CC: @njhill @markmc @ekagra-ranjan @WoosukKwon @LiuXiaoxuanPKU

gemini-code-assist

Code Review

This pull request adjusts the threshold for a flaky test related to n-gram speculative decoding to fix CI failures. My review highlights a concern that this is a temporary fix for a potentially deeper issue. The test uses deterministic sampling parameters (temperature=0) but shows non-deterministic behavior, which could point to a bug. I've added a comment recommending an investigation into the root cause of the flakiness rather than repeatedly lowering the test's threshold, as this could hide future regressions.

gemini-code-assist · 2025-09-12T18:15:58Z

tests/v1/e2e/test_spec_decode.py

While lowering the threshold might fix the immediate flakiness in CI, this is the second time this has been done for this test. This pattern of repeatedly reducing the threshold is concerning as it can mask underlying bugs or performance regressions in the speculative decoding implementation.

Given that temperature=0 is used, the model's output should be deterministic. The fact that there are discrepancies between the reference and speculative outputs suggests a potential issue.

Instead of just lowering the threshold, it would be better to investigate the root cause of this non-determinism. Possible areas to investigate:

Are there any non-deterministic operations in the model or CUDA kernels that are not properly seeded or configured for deterministic execution?

Could there be floating-point precision differences that are causing the outputs to diverge?

Is there a subtle bug in the n-gram speculative decoding logic that causes it to not be equivalent to the reference implementation in some cases?

Addressing the root cause would lead to a more robust and reliable test.

Signed-off-by: wwl2755 <[email protected]>

njhill

Thanks @wwl2755.

We should really fix the nondeterminism here though!

Perhaps we could look at running this particular test with fp32 (and flex attention if that works with spec decode?)

Or otherwise hopefully #24583 could help us!

wwl2755 · 2025-09-12T18:45:37Z

We should really fix the nondeterminism here though!

Totally agreed with you!

Perhaps we could look at running this particular test with fp32 (and flex attention if that works with spec decode?)

Or otherwise hopefully #24583 could help us!

But actually I think ngram proposer would only run on CPU and should not generate any randomness? ~~The biggest suspect I have right now is the rejection sampler part.~~

I can open up a issue to track it. (#24777)

DarkLight1337 · 2025-09-13T07:09:24Z

Retrying flaky v1/entrypoints/openai/test_multi_api_servers.py::test_single_completion[ibm-research/PowerMoE-3b] - RuntimeError: Server failed to start in time

This test didn't fail that much before, I wonder what changed?

DarkLight1337 · 2025-09-13T07:45:09Z

It passed this time

wwl2755 · 2025-09-13T19:29:11Z

Retrying flaky v1/entrypoints/openai/test_multi_api_servers.py::test_single_completion[ibm-research/PowerMoE-3b] - RuntimeError: Server failed to start in time

This test didn't fail that much before, I wonder what changed?

Thanks for retrying that! I got no idea what exactly caused this. It seems something randomly went wrong during/after loading the weights. Anyway, I think it shouldn't be related to the change in this PR. 😂

… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]>

… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]> Signed-off-by: bbartels <[email protected]>

… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]>

… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]>

mergify bot added the v1 label Sep 12, 2025

gemini-code-assist bot reviewed Sep 12, 2025

View reviewed changes

fix flaky ngram test

e7969f2

Signed-off-by: wwl2755 <[email protected]>

wwl2755 force-pushed the fix_ngram branch from 8488d90 to e7969f2 Compare September 12, 2025 18:21

njhill approved these changes Sep 12, 2025

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 12, 2025

wwl2755 mentioned this pull request Sep 12, 2025

[Bug] / [Feature]: Determinism in E2E Spec Decode Test #24777

Open

1 task

DarkLight1337 merged commit cfa3234 into vllm-project:main Sep 13, 2025
29 checks passed

wwl2755 deleted the fix_ngram branch September 13, 2025 19:29

dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025

[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test…

816b457

… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]>

bbartels pushed a commit to bbartels/vllm that referenced this pull request Sep 15, 2025

[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test…

574f712

… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]> Signed-off-by: bbartels <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test…

e9a84da

… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test…

271829f

… again (vllm-project#24771) Signed-off-by: wwl2755 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again #24771

[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again #24771

Uh oh!

wwl2755 commented Sep 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Uh oh!

njhill left a comment

Uh oh!

wwl2755 commented Sep 12, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented Sep 13, 2025 •

edited

Loading

Uh oh!

DarkLight1337 commented Sep 13, 2025

Uh oh!

Uh oh!

wwl2755 commented Sep 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again #24771

[CI][Spec Decode] Adjust threshold for flaky ngram spec decoding test again #24771

Uh oh!

Conversation

wwl2755 commented Sep 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

wwl2755 commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Sep 13, 2025

Uh oh!

Uh oh!

wwl2755 commented Sep 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wwl2755 commented Sep 12, 2025 •

edited by github-actions bot

Loading

wwl2755 commented Sep 12, 2025 •

edited

Loading

DarkLight1337 commented Sep 13, 2025 •

edited

Loading