You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
I am trying to serve two LLMs (both meta-llama/Llama-3.2-3B models) using Triton vLLM backend with explicit GPU assignment in config.pbtxt. Model 1 (llama-323b) is assigned to gpu : [0] and model 2 (llama-323bv2) is assigned to gpu : [1]. However, both models appear to be loaded on GPU 0, contrary to the configuration
Triton Information
[What version of Triton are you using?]
(tritonserver:24.03-vllm-python-py3)
(vLLM version: 0.5.3)
Are you using the Triton container or did you build it yourself?
I built a container myself using a custom docker image. Sharing the Dockerfle below
FROM nvcr.io/nvidia/tritonserver:24.03-vllm-python-py3
ENV DEBIAN_FRONTEND=noninteractive
Description
I am trying to serve two LLMs (both meta-llama/Llama-3.2-3B models) using Triton vLLM backend with explicit GPU assignment in config.pbtxt. Model 1 (llama-323b) is assigned to gpu : [0] and model 2 (llama-323bv2) is assigned to gpu : [1]. However, both models appear to be loaded on GPU 0, contrary to the configuration
Triton Information
[What version of Triton are you using?]
(tritonserver:24.03-vllm-python-py3)
(vLLM version: 0.5.3)
Are you using the Triton container or did you build it yourself?
I built a container myself using a custom docker image. Sharing the Dockerfle below
FROM nvcr.io/nvidia/tritonserver:24.03-vllm-python-py3
ENV DEBIAN_FRONTEND=noninteractive
RUN pip install --upgrade pip
RUN pip install --ignore-installed blinker zipp
RUN pip install --upgrade
vllm==0.5.3
torch
transformers
safetensors
tritonclient[all]
sentencepiece
accelerate
numpy
mlflow
boto3
To Reproduce
Sharing relevant config files below:
<path_to_models>/models/llama323b/config.pbtxt
backend: "vllm"
instance_group [
{
kind: KIND_GPU
count: 1
gpus: [ 0 ]
}
]
model_transaction_policy {
decoupled: True
}
<path_to_models>/models/llama323bv2/config.pbtxt
backend: "vllm"
instance_group [
{
kind: KIND_GPU
count: 1
gpus: [ 1 ]
}
]
model_transaction_policy {
decoupled: True
}
<path_to_models>/models/llama323b/1/model.json
"model": "meta-llama/Llama-3.2-3B",
"tokenizer": "meta-llama/Llama-3.2-3B",
"trust_remote_code": true,
"dtype": "float16",
"max_model_len": 8192,
"rope_scaling": {
"type": "extended",
"factor": 8.0
},
"gpu_memory_utilization": 0.7,
"enforce_eager": true,
"disable_log_requests": true
}
<path_to_models>/models/llama323bv2/1/model.json
{
"model": "meta-llama/Llama-3.2-3B",
"tokenizer": "meta-llama/Llama-3.2-3B",
"trust_remote_code": true,
"dtype": "float16",
"max_model_len": 8192,
"rope_scaling": {
"type": "extended",
"factor": 8.0
},
"gpu_memory_utilization": 0.7,
"enforce_eager": true,
"disable_log_requests": true
}
Command used to run triton server from the bash of the docker container:
tritonserver --model-repository=<path_to_model>/models
Observed Logs:
Triton logs seem to indicate gpu assignment is correct
TRITONBACKEND_ModelInstanceInitialize: llama-323bv2_0_0 (GPU device 1)
TRITONBACKEND_ModelInstanceInitialize: llama-323b_0_0 (GPU device 0)
Running nvidia-smi, I am seeing both models running on GPU device 0
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3104247 C tritonserver 370MiB |
| 0 N/A N/A 3104798 C ...s/python/triton_python_backend_stub 14486MiB |
| 0 N/A N/A 3104887 C ...s/python/triton_python_backend_stub 14112MiB |
| 1 N/A N/A 3104247 C tritonserver 370MiB |
| 2 N/A N/A 3104247 C tritonserver 370MiB
Expected behavior
A clear and concise description of what you expected to happen.
I am expecting model llama-323b to run on GPU 0 and model llama-323bv2 to run on GPU 1.
The text was updated successfully, but these errors were encountered: