Skip to content

vLLM backend: Model-specific GPU assignment ignored — both models loaded on GPU 0 despite config.pbtxt specifying gpus: [0] and gpus: [1] #8189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shrivats1995 opened this issue May 7, 2025 · 0 comments

Comments

@shrivats1995
Copy link

Description
I am trying to serve two LLMs (both meta-llama/Llama-3.2-3B models) using Triton vLLM backend with explicit GPU assignment in config.pbtxt. Model 1 (llama-323b) is assigned to gpu : [0] and model 2 (llama-323bv2) is assigned to gpu : [1]. However, both models appear to be loaded on GPU 0, contrary to the configuration

Triton Information
[What version of Triton are you using?]
(tritonserver:24.03-vllm-python-py3)
(vLLM version: 0.5.3)

Are you using the Triton container or did you build it yourself?
I built a container myself using a custom docker image. Sharing the Dockerfle below
FROM nvcr.io/nvidia/tritonserver:24.03-vllm-python-py3
ENV DEBIAN_FRONTEND=noninteractive

RUN pip install --upgrade pip

RUN pip install --ignore-installed blinker zipp

RUN pip install --upgrade
vllm==0.5.3
torch
transformers
safetensors
tritonclient[all]
sentencepiece
accelerate
numpy
mlflow
boto3

To Reproduce

Sharing relevant config files below:

<path_to_models>/models/llama323b/config.pbtxt
backend: "vllm"

instance_group [
{
kind: KIND_GPU
count: 1
gpus: [ 0 ]
}
]

model_transaction_policy {
decoupled: True
}

<path_to_models>/models/llama323bv2/config.pbtxt
backend: "vllm"

instance_group [
{
kind: KIND_GPU
count: 1
gpus: [ 1 ]
}
]

model_transaction_policy {
decoupled: True
}

<path_to_models>/models/llama323b/1/model.json
"model": "meta-llama/Llama-3.2-3B",
"tokenizer": "meta-llama/Llama-3.2-3B",
"trust_remote_code": true,
"dtype": "float16",
"max_model_len": 8192,
"rope_scaling": {
"type": "extended",
"factor": 8.0
},
"gpu_memory_utilization": 0.7,
"enforce_eager": true,
"disable_log_requests": true
}
<path_to_models>/models/llama323bv2/1/model.json
{
"model": "meta-llama/Llama-3.2-3B",
"tokenizer": "meta-llama/Llama-3.2-3B",
"trust_remote_code": true,
"dtype": "float16",
"max_model_len": 8192,
"rope_scaling": {
"type": "extended",
"factor": 8.0
},
"gpu_memory_utilization": 0.7,
"enforce_eager": true,
"disable_log_requests": true
}

Command used to run triton server from the bash of the docker container:

tritonserver --model-repository=<path_to_model>/models

Observed Logs:

Triton logs seem to indicate gpu assignment is correct
TRITONBACKEND_ModelInstanceInitialize: llama-323bv2_0_0 (GPU device 1)
TRITONBACKEND_ModelInstanceInitialize: llama-323b_0_0 (GPU device 0)

Running nvidia-smi, I am seeing both models running on GPU device 0

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3104247 C tritonserver 370MiB |
| 0 N/A N/A 3104798 C ...s/python/triton_python_backend_stub 14486MiB |
| 0 N/A N/A 3104887 C ...s/python/triton_python_backend_stub 14112MiB |
| 1 N/A N/A 3104247 C tritonserver 370MiB |
| 2 N/A N/A 3104247 C tritonserver 370MiB

Expected behavior
A clear and concise description of what you expected to happen.

I am expecting model llama-323b to run on GPU 0 and model llama-323bv2 to run on GPU 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant