Skip to content

container run failed when using containerd instead of docker #1138

Open
@williamfpx

Description

@williamfpx

1. Issue or feature description
ctr run --rm -t --gpus 0 general-vllm-infer-service:0.1.8 test nvidia-smi

ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown
here is /etc/containerd/config.toml
`
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
disable_snapshot_annotations = true
discard_unpacked_layers = false
ignore_rdt_not_enabled_errors = false
no_pivot = false
systemd_cgroup = false

  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
      base_runtime_spec = ""
      cni_conf_dir = ""
      cni_max_conf_num = 0
      container_annotations = []
      pod_annotations = []
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_path = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
        BinaryName = "/usr/bin/nvidia-container-runtime"
        CriuImagePath = ""
        CriuPath = ""
        CriuWorkPath = ""
        IoGid = 0
        IoUid = 0
        NoNewKeyring = false
        NoPivotRoot = false
        Root = ""
        ShimCgroup = ""
        SystemdCgroup = false

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
      base_runtime_spec = ""
      cni_conf_dir = ""
      cni_max_conf_num = 0
      container_annotations = []
      pod_annotations = []
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_path = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        BinaryName = ""
        CriuImagePath = ""
        CriuPath = ""
        CriuWorkPath = ""
        IoGid = 0
        IoUid = 0
        NoNewKeyring = false
        NoPivotRoot = false
        Root = ""
        ShimCgroup = ""
        SystemdCgroup = false

`

using docker, it works:
docker run --rm --gpus all general-vllm-infer-service:0.1.8 nvidia-smi
Tue Jun 10 14:57:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12 Driver Version: 550.90.12 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10 Off | 00000000:05:00.0 Off | 0 |
| 0% 46C P0 62W / 150W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

2. Steps to reproduce the issue

ctr run --rm -t --gpus 0 general-vllm-infer-service:0.1.8 test nvidia-smi

environment:
HostOS: centos7
containrd: 1.6.33-3.1
docker: 26.1.4
nvidia-container-toolkit: 1.17.8

3. what differences between them
comparing specs generated by containerd and docker, significat differece is below:

Image

Additionally, logs are found in file '/var/log/nvidia-container-runtime.log' and '/var/log/nvidia-container-toolkit.log' when using docker run, but nothing when using ctr run.

/var/log/nvidia-container-toolkit.log

I0610 03:24:34.656249 3245 nvc.c:396] initializing library context (version=1.17.8, build=6eda4d76c8c5f8fc174e4abca83e513fb4dd63b0)
I0610 03:24:34.656307 3245 nvc.c:367] using root /
I0610 03:24:34.656311 3245 nvc.c:368] using ldcache /etc/ld.so.cache
I0610 03:24:34.656313 3245 nvc.c:369] using unprivileged user 65534:65534
I0610 03:24:34.656330 3245 nvc.c:413] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0610 03:24:34.656370 3245 nvc.c:415] dxcore initialization failed, continuing assuming a non-WSL environment
I0610 03:24:34.659852 3260 nvc.c:278] loading kernel module nvidia
I0610 03:24:34.660003 3260 nvc.c:282] running mknod for /dev/nvidiactl
I0610 03:24:34.660093 3260 nvc.c:286] running mknod for /dev/nvidia0
I0610 03:24:34.660132 3260 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0610 03:24:34.663391 3260 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0610 03:24:34.663505 3260 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0610 03:24:34.665444 3260 nvc.c:304] loading kernel module nvidia_uvm
I0610 03:24:34.665496 3260 nvc.c:308] running mknod for /dev/nvidia-uvm
I0610 03:24:34.665561 3260 nvc.c:313] loading kernel module nvidia_modeset
I0610 03:24:34.665594 3260 nvc.c:317] running mknod for /dev/nvidia-modeset

/var/log/nvidia-container-runtime.log

{"level":"debug","msg":"Checking candidate '/usr/bin/runc'","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Found 1 candidates; ignoring further candidates","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Using bundle directory: /run/containerd/io.containerd.runtime.v2.task/moby/2978ebc37fded9a3b952a46eb0c9d3fa1186d3101c2da522fc06c96b35098a3a","time":"2025-06-10T03:24:31Z"}
{"level":"info","msg":"Using OCI specification file path: /run/containerd/io.containerd.runtime.v2.task/moby/2978ebc37fded9a3b952a46eb0c9d3fa1186d3101c2da522fc06c96b35098a3a/config.json","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Is WSL-based system? false: could not load DXCore library: libdxcore.so: cannot open shared object file: No such file or directory","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Is Tegra-based system? false: /sys/devices/soc0/family file not found","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Is NVML-based system? true: found NVML library","time":"2025-06-10T03:24:31Z"}
{"level":"debug","msg":"Has only integrated GPUs? false: device "NVIDIA A10" does not use nvgpu module","time":"2025-06-10T03:24:32Z"}

4. two questions to be solved
question 1:What's wrong with my image ’general-vllm-infer-service:0.1.8‘, why it can work using docker,can't work using containerd

question 2:container spec is different,how it happen?How can I get nvidia-container-toolkit log when using containerd?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions