Skip to content

[Usage]: cuda12.8 docker 0.11.0 Error occurs when launching the model, NCCL error: unhandled cuda error. #26786

@ooodwbooo

Description

@ooodwbooo

When I use only a single graphics card, the system can start up normally.
Below are Docker configuration files, logs, and environment information.

I encountered this issue when upgrading from version 10.1.1 to 10.2.

The system generates an error when using dual graphics cards; version 10.1.1 functions correctly, but version 10.2 triggers an error upon execution.

Your current environment

# vllm collect-env   
INFO 10-14 19:07:58 [__init__.py:216] Automatically detected platform cuda.
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 4.1.0
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.8.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.93
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version        : 571.96
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               40
On-line CPU(s) list:                  0-39
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
CPU family:                           6
Model:                                85
Thread(s) per core:                   2
Core(s) per socket:                   20
Socket(s):                            1
Stepping:                             4
BogoMIPS:                             4788.75
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti ssbd ibrs ibpb stibp fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves md_clear flush_l1d arch_capabilities
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            640 KiB (20 instances)
L1i cache:                            640 KiB (20 instances)
L2 cache:                             20 MiB (20 instances)
L3 cache:                             27.5 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-39
Vulnerability Gather data sampling:   Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Mitigation; Clear CPU buffers; SMT Host state unknown

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.3.1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.14.1
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-ml-py==12.575.51
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pynvml==12.0.0
[pip3] pyzmq==27.1.0
[pip3] torch==2.8.0+cu128
[pip3] torchaudio==2.8.0+cu128
[pip3] torchvision==0.23.0+cu128
[pip3] transformers==4.57.0
[pip3] triton==3.4.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.11.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS                             N/A
GPU1    SYS      X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/cuda/lib64
CUDA_VERSION=12.8.1
NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
NVIDIA_VISIBLE_DEVICES=all
NCCL_VERSION=2.25.1-1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root
VLLM_WORKER_MULTIPROC_METHOD=spawn
CUDA_MODULE_LOADING=LAZY

How would you like to use vllm

services:

  vllm-openai-8000:
    runtime: nvidia

    # 使用所有gpu
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities:
                - gpu
    command: >
      --model /models/safetensors/cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
      --served-model-name Qwen/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit
      --gpu-memory-utilization 0.8 
      --kv-cache-dtype auto 
      --enable-expert-parallel  
      --tensor-parallel-size 2
      --enable-auto-tool-choice 
      --tool-call-parser hermes
    environment:
      - NCCL_DEBUG=INFO

    volumes:
      - ./models/.cache/huggingface:/root/.cache/huggingface
      - ./models/safetensors:/models/safetensors
    dns:
      - 8.8.8.8
    ports:
      - 8001:8000
    ipc: host
    image: vllm/vllm-openai:v0.11.0

error log

INFO 10-14 01:25:52 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=1) INFO 10-14 01:26:02 [api_server.py:1839] vLLM API server version 0.11.0
(APIServer pid=1) INFO 10-14 01:26:02 [utils.py:233] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': '/models/safetensors/cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit', 'served_model_name': ['Qwen/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit'], 'tensor_parallel_size': 2, 'enable_expert_parallel': True, 'gpu_memory_utilization': 0.8}
(APIServer pid=1) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=1) INFO 10-14 01:26:02 [model.py:547] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=1) INFO 10-14 01:26:02 [model.py:1510] Using max model len 262144
(APIServer pid=1) INFO 10-14 01:26:09 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 10-14 01:26:09 [config.py:297] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
(APIServer pid=1) INFO 10-14 01:26:09 [config.py:308] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
(APIServer pid=1) INFO 10-14 01:26:10 [config.py:376] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 10-14 01:26:10 [config.py:397] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
INFO 10-14 01:26:15 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=112) INFO 10-14 01:26:20 [core.py:644] Waiting for init message from front-end.
(EngineCore_DP0 pid=112) INFO 10-14 01:26:20 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/models/safetensors/cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit', speculative_config=None, tokenizer='/models/safetensors/cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=112) WARNING 10-14 01:26:20 [multiproc_executor.py:720] Reducing Torch parallelism from 20 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=112) INFO 10-14 01:26:20 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_078e52a0'), local_subscribe_addr='ipc:///tmp/1b3233d5-c29c-422a-9411-855ebf41dbc4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-14 01:26:24 [__init__.py:216] Automatically detected platform cuda.
INFO 10-14 01:26:24 [__init__.py:216] Automatically detected platform cuda.
WARNING 10-14 01:26:30 [interface.py:381] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
WARNING 10-14 01:26:30 [interface.py:381] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
W1014 01:26:30.387000 198 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W1014 01:26:30.387000 198 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1014 01:26:30.387000 199 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W1014 01:26:30.387000 199 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 10-14 01:26:32 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_577a0fc3'), local_subscribe_addr='ipc:///tmp/cca0ec69-4bd4-4f5d-87a4-5a80fdd2446f', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-14 01:26:32 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_470c596a'), local_subscribe_addr='ipc:///tmp/163267cc-b2ab-4529-b91c-6f26a434404a', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 10-14 01:26:33 [__init__.py:1384] Found nccl from library libnccl.so.2
INFO 10-14 01:26:33 [__init__.py:1384] Found nccl from library libnccl.so.2
INFO 10-14 01:26:33 [pynccl.py:103] vLLM is using nccl==2.27.3
INFO 10-14 01:26:33 [pynccl.py:103] vLLM is using nccl==2.27.3
ERROR 10-14 01:26:34 [multiproc_executor.py:597] WorkerProc failed to start.
ERROR 10-14 01:26:34 [multiproc_executor.py:597] Traceback (most recent call last):
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
ERROR 10-14 01:26:34 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 430, in __init__
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     self.worker.init_device()
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 259, in init_device
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     self.worker.init_device()  # type: ignore
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 169, in init_device
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     init_worker_distributed_environment(self.vllm_config, self.rank,
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 705, in init_worker_distributed_environment
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     ensure_model_parallel_initialized(
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1228, in ensure_model_parallel_initialized
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     initialize_model_parallel(tensor_model_parallel_size,
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1152, in initialize_model_parallel
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     _TP = init_model_parallel_group(group_ranks,
ERROR 10-14 01:26:34 [multiproc_executor.py:597]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 924, in init_model_parallel_group
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     return GroupCoordinator(
ERROR 10-14 01:26:34 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 255, in __init__
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     self.device_communicator = device_comm_cls(
ERROR 10-14 01:26:34 [multiproc_executor.py:597]                                ^^^^^^^^^^^^^^^^
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/cuda_communicator.py", line 57, in __init__
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     self.pynccl_comm = PyNcclCommunicator(
ERROR 10-14 01:26:34 [multiproc_executor.py:597]                        ^^^^^^^^^^^^^^^^^^^
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 139, in __init__
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     self.all_reduce(data)
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl.py", line 162, in all_reduce
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     self.nccl.ncclAllReduce(buffer_type(in_tensor.data_ptr()),
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 337, in ncclAllReduce
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     self.NCCL_CHECK(self._funcs["ncclAllReduce"](sendbuff, recvbuff, count,
ERROR 10-14 01:26:34 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 291, in NCCL_CHECK
ERROR 10-14 01:26:34 [multiproc_executor.py:597]     raise RuntimeError(f"NCCL error: {error_str}")
ERROR 10-14 01:26:34 [multiproc_executor.py:597] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
INFO 10-14 01:26:34 [multiproc_executor.py:558] Parent process exited, terminating worker
INFO 10-14 01:26:34 [multiproc_executor.py:558] Parent process exited, terminating worker
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708] EngineCore failed to start.
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708] Traceback (most recent call last):
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]     self._init_executor()
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708]     raise e from None
(EngineCore_DP0 pid=112) ERROR 10-14 01:26:38 [core.py:708] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=112) Process EngineCore_DP0:
(EngineCore_DP0 pid=112) Traceback (most recent call last):
(EngineCore_DP0 pid=112)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=112)     self.run()
(EngineCore_DP0 pid=112)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=112)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=112)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=112)     raise e
(EngineCore_DP0 pid=112)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=112)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=112)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=112)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=112)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=112)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=112)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=112)     self._init_executor()
(EngineCore_DP0 pid=112)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=112)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=112)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=112)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=112)     raise e from None
(EngineCore_DP0 pid=112) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=1)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1953, in <module>
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1884, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1902, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 225, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 1572, in inner
(APIServer pid=1)     return fn(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 207, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 134, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=1)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 732, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 785, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    usageHow to use vllm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions