Skip to content

Conversation

wwl2755
Copy link
Contributor

@wwl2755 wwl2755 commented Sep 12, 2025

@mergify mergify bot added the v1 label Sep 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adjusts the threshold for a flaky test related to n-gram speculative decoding to fix CI failures. My review highlights a concern that this is a temporary fix for a potentially deeper issue. The test uses deterministic sampling parameters (temperature=0) but shows non-deterministic behavior, which could point to a bug. I've added a comment recommending an investigation into the root cause of the flakiness rather than repeatedly lowering the test's threshold, as this could hide future regressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While lowering the threshold might fix the immediate flakiness in CI, this is the second time this has been done for this test. This pattern of repeatedly reducing the threshold is concerning as it can mask underlying bugs or performance regressions in the speculative decoding implementation.

Given that temperature=0 is used, the model's output should be deterministic. The fact that there are discrepancies between the reference and speculative outputs suggests a potential issue.

Instead of just lowering the threshold, it would be better to investigate the root cause of this non-determinism. Possible areas to investigate:

  • Are there any non-deterministic operations in the model or CUDA kernels that are not properly seeded or configured for deterministic execution?
  • Could there be floating-point precision differences that are causing the outputs to diverge?
  • Is there a subtle bug in the n-gram speculative decoding logic that causes it to not be equivalent to the reference implementation in some cases?

Addressing the root cause would lead to a more robust and reliable test.

Signed-off-by: wwl2755 <[email protected]>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wwl2755.

We should really fix the nondeterminism here though!

Perhaps we could look at running this particular test with fp32 (and flex attention if that works with spec decode?)

Or otherwise hopefully #24583 could help us!

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 12, 2025
@wwl2755
Copy link
Contributor Author

wwl2755 commented Sep 12, 2025

We should really fix the nondeterminism here though!

Totally agreed with you!

Perhaps we could look at running this particular test with fp32 (and flex attention if that works with spec decode?)

Or otherwise hopefully #24583 could help us!

But actually I think ngram proposer would only run on CPU and should not generate any randomness? The biggest suspect I have right now is the rejection sampler part.

I can open up a issue to track it. (#24777)

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 13, 2025

Retrying flaky v1/entrypoints/openai/test_multi_api_servers.py::test_single_completion[ibm-research/PowerMoE-3b] - RuntimeError: Server failed to start in time

This test didn't fail that much before, I wonder what changed?

@DarkLight1337
Copy link
Member

It passed this time

@DarkLight1337 DarkLight1337 merged commit cfa3234 into vllm-project:main Sep 13, 2025
29 checks passed
@wwl2755
Copy link
Contributor Author

wwl2755 commented Sep 13, 2025

Retrying flaky v1/entrypoints/openai/test_multi_api_servers.py::test_single_completion[ibm-research/PowerMoE-3b] - RuntimeError: Server failed to start in time

This test didn't fail that much before, I wonder what changed?

Thanks for retrying that! I got no idea what exactly caused this. It seems something randomly went wrong during/after loading the weights. Anyway, I think it shouldn't be related to the change in this PR. 😂

@wwl2755 wwl2755 deleted the fix_ngram branch September 13, 2025 19:29
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
bbartels pushed a commit to bbartels/vllm that referenced this pull request Sep 15, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants