Skip to content

Conversation

SageMoore
Copy link
Contributor

@SageMoore SageMoore commented Sep 24, 2025

Purpose

Currently on vllm will output the following warning when capturing DBO cudagraphs.

/home/sage/git/nm-vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:1166: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:179.)
(EngineCore_DP0 pid=3571065)   return torch._C._cuda_getCurrentBlasHandle()
(EngineCore_DP1 pid=3571066) /home/sage/git/nm-vllm/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:1166: UserWarning: Attempting to run cuBLAS, but there was no current CUDA context! Attempting to set the primary context... (Triggered internally at /pytorch/aten/src/ATen/cuda/CublasHandlePool.cpp:179.)
(EngineCore_DP1 pid=3571066)   return torch._C._cuda_getCurrentBlasHandle()

This warning is completely benign so we should suppress it.

Test Plan

Spun up a vllm server with DBO enabled and confirmed that the message no longer appears

Test Result

VLLM_ALL2ALL_BACKEND=deepep_low_latency vllm serve --model="deepseek-ai/DeepSeek-V2-Lite" --data-parallel-size 2 --enable-expert-parallel --gpu-memory-utilization 0.75 --enable-dbo

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3567|±  |0.0277|
|     |       |strict-match    |     5|exact_match|↑  |0.3533|±  |0.0276|

Signed-off-by: Sage Moore <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a benign cuBLAS warning that occurs during CUDA graph capture with DBO. The root cause is correctly identified as a missing CUDA context in the worker threads. The fix involves storing the device in the UBatchWrapper and explicitly setting it at the beginning of the _capture_ubatch_thread using torch.cuda.set_device(). This ensures a CUDA context is established before any cuBLAS operations are attempted, effectively resolving the warning. The change is clean, well-targeted, and correctly implemented. I have no further suggestions for improvement.

@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 24, 2025
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) September 24, 2025 16:53
@tlrmchlsmth tlrmchlsmth merged commit f84a472 into vllm-project:main Sep 24, 2025
55 checks passed
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants