vLLM backend: Model-specific GPU assignment ignored — both models loaded on GPU 0 despite config.pbtxt specifying gpus: [0] and gpus: [1] #8189

shrivats1995 · 2025-05-07T05:54:29Z

Description
I am trying to serve two LLMs (both meta-llama/Llama-3.2-3B models) using Triton vLLM backend with explicit GPU assignment in config.pbtxt. Model 1 (llama-323b) is assigned to gpu : [0] and model 2 (llama-323bv2) is assigned to gpu : [1]. However, both models appear to be loaded on GPU 0, contrary to the configuration

Triton Information
[What version of Triton are you using?]
(tritonserver:24.03-vllm-python-py3)
(vLLM version: 0.5.3)

Are you using the Triton container or did you build it yourself?
I built a container myself using a custom docker image. Sharing the Dockerfle below
FROM nvcr.io/nvidia/tritonserver:24.03-vllm-python-py3
ENV DEBIAN_FRONTEND=noninteractive

RUN pip install --upgrade pip

RUN pip install --ignore-installed blinker zipp

RUN pip install --upgrade
vllm==0.5.3
torch
transformers
safetensors
tritonclient[all]
sentencepiece
accelerate
numpy
mlflow
boto3

To Reproduce

Sharing relevant config files below:

<path_to_models>/models/llama323b/config.pbtxt
backend: "vllm"

instance_group [
{
kind: KIND_GPU
count: 1
gpus: [ 0 ]
}
]

model_transaction_policy {
decoupled: True
}

<path_to_models>/models/llama323bv2/config.pbtxt
backend: "vllm"

instance_group [
{
kind: KIND_GPU
count: 1
gpus: [ 1 ]
}
]

model_transaction_policy {
decoupled: True
}

<path_to_models>/models/llama323b/1/model.json
"model": "meta-llama/Llama-3.2-3B",
"tokenizer": "meta-llama/Llama-3.2-3B",
"trust_remote_code": true,
"dtype": "float16",
"max_model_len": 8192,
"rope_scaling": {
"type": "extended",
"factor": 8.0
},
"gpu_memory_utilization": 0.7,
"enforce_eager": true,
"disable_log_requests": true
}
<path_to_models>/models/llama323bv2/1/model.json
{
"model": "meta-llama/Llama-3.2-3B",
"tokenizer": "meta-llama/Llama-3.2-3B",
"trust_remote_code": true,
"dtype": "float16",
"max_model_len": 8192,
"rope_scaling": {
"type": "extended",
"factor": 8.0
},
"gpu_memory_utilization": 0.7,
"enforce_eager": true,
"disable_log_requests": true
}

Command used to run triton server from the bash of the docker container:

tritonserver --model-repository=<path_to_model>/models

Observed Logs:

Triton logs seem to indicate gpu assignment is correct
TRITONBACKEND_ModelInstanceInitialize: llama-323bv2_0_0 (GPU device 1)
TRITONBACKEND_ModelInstanceInitialize: llama-323b_0_0 (GPU device 0)

Running nvidia-smi, I am seeing both models running on GPU device 0

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3104247 C tritonserver 370MiB |
| 0 N/A N/A 3104798 C ...s/python/triton_python_backend_stub 14486MiB |
| 0 N/A N/A 3104887 C ...s/python/triton_python_backend_stub 14112MiB |
| 1 N/A N/A 3104247 C tritonserver 370MiB |
| 2 N/A N/A 3104247 C tritonserver 370MiB

Expected behavior
A clear and concise description of what you expected to happen.

I am expecting model llama-323b to run on GPU 0 and model llama-323bv2 to run on GPU 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM backend: Model-specific GPU assignment ignored — both models loaded on GPU 0 despite config.pbtxt specifying gpus: [0] and gpus: [1] #8189

vLLM backend: Model-specific GPU assignment ignored — both models loaded on GPU 0 despite config.pbtxt specifying gpus: [0] and gpus: [1] #8189

shrivats1995 commented May 7, 2025

vLLM backend: Model-specific GPU assignment ignored — both models loaded on GPU 0 despite config.pbtxt specifying gpus: [0] and gpus: [1] #8189

vLLM backend: Model-specific GPU assignment ignored — both models loaded on GPU 0 despite config.pbtxt specifying gpus: [0] and gpus: [1] #8189

Comments

shrivats1995 commented May 7, 2025