Description
1. Issue or feature description
ctr run --rm -t --gpus 0 general-vllm-infer-service:0.1.8 test nvidia-smi
ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown
here is /etc/containerd/config.toml
`
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
disable_snapshot_annotations = true
discard_unpacked_layers = false
ignore_rdt_not_enabled_errors = false
no_pivot = false
systemd_cgroup = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
base_runtime_spec = ""
cni_conf_dir = ""
cni_max_conf_num = 0
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_path = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = false
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
base_runtime_spec = ""
cni_conf_dir = ""
cni_max_conf_num = 0
container_annotations = []
pod_annotations = []
privileged_without_host_devices = false
runtime_engine = ""
runtime_path = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
BinaryName = ""
CriuImagePath = ""
CriuPath = ""
CriuWorkPath = ""
IoGid = 0
IoUid = 0
NoNewKeyring = false
NoPivotRoot = false
Root = ""
ShimCgroup = ""
SystemdCgroup = false
`
using docker, it works:
docker run --rm --gpus all general-vllm-infer-service:0.1.8 nvidia-smi
Tue Jun 10 14:57:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10 Off | 00000000:05:00.0 Off | 0 |
| 0% 46C P0 62W / 150W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
2. Steps to reproduce the issue
ctr run --rm -t --gpus 0 general-vllm-infer-service:0.1.8 test nvidia-smi
environment:
HostOS: centos7
containrd: 1.6.33-3.1
docker: 26.1.4
nvidia-container-toolkit: 1.17.8
3. what differences between them
comparing specs generated by containerd and docker, significat differece is below:
Additionally, logs are found in file '/var/log/nvidia-container-runtime.log' and '/var/log/nvidia-container-toolkit.log' when using docker run, but nothing when using ctr run.
/var/log/nvidia-container-toolkit.log
I0610 03:24:34.656249 3245 nvc.c:396] initializing library context (version=1.17.8, build=6eda4d76c8c5f8fc174e4abca83e513fb4dd63b0)
I0610 03:24:34.656307 3245 nvc.c:367] using root /
I0610 03:24:34.656311 3245 nvc.c:368] using ldcache /etc/ld.so.cache
I0610 03:24:34.656313 3245 nvc.c:369] using unprivileged user 65534:65534
I0610 03:24:34.656330 3245 nvc.c:413] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0610 03:24:34.656370 3245 nvc.c:415] dxcore initialization failed, continuing assuming a non-WSL environment
I0610 03:24:34.659852 3260 nvc.c:278] loading kernel module nvidia
I0610 03:24:34.660003 3260 nvc.c:282] running mknod for /dev/nvidiactl
I0610 03:24:34.660093 3260 nvc.c:286] running mknod for /dev/nvidia0
I0610 03:24:34.660132 3260 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0610 03:24:34.663391 3260 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0610 03:24:34.663505 3260 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0610 03:24:34.665444 3260 nvc.c:304] loading kernel module nvidia_uvm
I0610 03:24:34.665496 3260 nvc.c:308] running mknod for /dev/nvidia-uvm
I0610 03:24:34.665561 3260 nvc.c:313] loading kernel module nvidia_modeset
I0610 03:24:34.665594 3260 nvc.c:317] running mknod for /dev/nvidia-modeset
/var/log/nvidia-container-runtime.log
{"level":"debug","msg":"Checking candidate '/usr/bin/runc'","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Found 1 candidates; ignoring further candidates","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Using bundle directory: /run/containerd/io.containerd.runtime.v2.task/moby/2978ebc37fded9a3b952a46eb0c9d3fa1186d3101c2da522fc06c96b35098a3a","time":"2025-06-10T03:24:31Z"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/2978ebc37fded9a3b952a46eb0c9d3fa1186d3101c2da522fc06c96b35098a3a/config.json","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Is WSL-based system? false: could not load DXCore library: libdxcore.so: cannot open shared object file: No such file or directory","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Is Tegra-based system? false: /sys/devices/soc0/family file not found","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Is NVML-based system? true: found NVML library","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Has only integrated GPUs? false: device "NVIDIA A10" does not use nvgpu module","time":"2025-06-10T03:24:32Z"}
4. two questions to be solved
question 1:What's wrong with my image ’general-vllm-infer-service:0.1.8‘, why it can work using docker,can't work using containerd
question 2:container spec is different,how it happen?How can I get nvidia-container-toolkit log when using containerd?