For containerized MPI applications, the host MPI library and its dependencies are swapped into the container at runtime to enable cross-node communication. However, one critical dependency not brought in is glibc. Due to glibc's backward compatibility, the swapped-in MPI library functions correctly in container images with newer glibc versions. But for images with older glibc, applications often fail to run with the swapped-in MPI library. The glibc version in the image must meet or exceed the highest version required by the MPI library or its dependencies.
The error might look like this:
/app/check-mpi: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /opt/udiImage/modules/mpich/dep/libfabric.so.1)
/app/check-mpi: /lib64/libc.so.6: version `GLIBC_2.26' not found (required by /opt/udiImage/modules/mpich/dep/libfabric.so.1)
/app/check-mpi: /lib64/libm.so.6: version `GLIBC_2.26' not found (required by /opt/udiImage/modules/mpich/dep/libgfortran.so.5)
/app/check-mpi: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /opt/udiImage/modules/mpich/dep/libcxi.so.1)
/app/check-mpi: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /opt/udiImage/modules/mpich/dep/libssh.so.4)
/app/check-mpi: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /opt/udiImage/modules/mpich/dep/libgssapi_krb5.so.2)
/app/check-mpi: /lib64/libselinux.so.1: no version information available (required by /opt/udiImage/modules/mpich/dep/libkrb5support.so.0)
In this example, the application requires glibc version 2.27 or above, but the image contains glibc version 2.25.
[root@nid200005 app]# /lib64/libc-2.25.so
GNU C Library (GNU libc) stable release version 2.25, by Roland McGrath et al.
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 7.2.1 20170915 (Red Hat 7.2.1-2).
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.
You can verify this using:
objdump -T <shared_lib> | grep GLIBC | sed 's/.*GLIBC_\([.0-9]*\).*/\1/g' | sort -V | tail -n 1This command will reveal the highest glibc version required by the application’s dependencies. In this case, the maximum requirement is GLIBC_2.27 for libfabric.so.
The following commands lists the glibc version requirements for each swapped-in library:
for i in $(ls /usr/lib/shifter/mpich-2.2/*.so*); do
echo $i;
objdump -T $i | grep GLIBC | sed 's/.*GLIBC_\([.0-9]*\).*/\1/g' | sort -V | tail -n 1;
done
for i in $(ls /usr/lib/shifter/mpich-2.2/dep/*.so*); do
echo $i;
objdump -T $i | grep GLIBC | sed 's/.*GLIBC_\([.0-9]*\).*/\1/g' | sort -V | tail -n 1;
done
- Runtime Modifications: Modifying the image at runtime using tools like
patchelfis impractical, especially for large-scale jobs (srun -n <many_num_jobs>). - Image Rebuild: Rebuilding the image with pre-applied fixes is possible but not ideal. A better solution avoids rebuilding.
Adding only libc.so.6 from the host fails because glibc involves multiple interdependent libraries. For example:
[root@nid200005 scratch]# LD_LIBRARY_PATH=/scratch/libc_only:$LD_LIBRARY_PATH /app/check-mpi
/app/check-mpi: /lib64/libm.so.6: version `GLIBC_2.26' not found (required by /opt/udiImage/modules/mpich/dep/libgfortran.so.5)
Here, libgfortran.so.5 depends on libm.so with version GLIBC_2.26.
Even with all related libraries, the following error occurs:
[root@nid200005 scratch]# LD_LIBRARY_PATH=/scratch/libc_libm:$LD_LIBRARY_PATH /app/check-mpi
/app/check-mpi: /lib64/libselinux.so.1: no version information available (required by /opt/udiImage/modules/mpich/dep/libkrb5support.so.0)
/app/check-mpi: relocation error: /scratch/libc_libm/libc.so.6: symbol _dl_exception_create, version GLIBC_PRIVATE not defined in file ld-linux-x86-64.so.2 with link time reference
This happens because ld-linux.so.2 and libc.so.6 are mismatched. The executable uses the hardcoded ld-linux.so.2 path from its link time, ignoring alternatives in LD_LIBRARY_PATH.
Nice try! You cannot.
[root@nid200005 scratch]# cp /scratch/host_lib64/ld-2.31.so /lib64/
[root@nid200005 scratch]# cp /scratch/host_lib64/ld-linux-x86-64.so.2 /lib64/
cp: overwrite '/lib64/ld-linux-x86-64.so.2'? y
cp: cannot create regular file '/lib64/ld-linux-x86-64.so.2': Text file busyThe reason is simple: almost every binary depends on it! Even cp itself relies on it.
[root@nid200005 /]# ldd /bin/cp
linux-vdso.so.1 (0x00007ffd965b1000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f0981336000)
libacl.so.1 => /lib64/libacl.so.1 (0x00007f098112d000)
libattr.so.1 => /lib64/libattr.so.1 (0x00007f0980f28000)
libc.so.6 => /lib64/libc.so.6 (0x00007f0980b53000)
libpcre.so.1 => /lib64/libpcre.so.1 (0x00007f09808e1000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f09806dd000)
/lib64/ld-linux-x86-64.so.2 (0x00007f0981780000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f09804be000)If we cannot change the system ld-linux-x86-64.so.2, how can we use an alternate version?
The path to the interpreter (ld-linux.so.2) is hardcoded into the binary at link time. The obvious solution is to use patchelf to rewrite it. However, we want to avoid modifying the binaries directly.
For reference, here’s how to use patchelf for this purpose. Note that this example includes the full suite of GLIBC libraries:
# Using release-0.13 branch of https://github.com/NixOS/patchelf.git
# ./bootstrap.sh && ./configure && make -j20 && make install
[root@nid200005 app]# patchelf --set-interpreter /scratch/host_lib64/ld-2.31.so check-mpi
[root@nid200005 app]# LD_LIBRARY_PATH=/scratch/host_lib64 ./check-mpi
Hello from rank 0, on nid200005. (core affinity = 0-255)
This brings us to the final solution. What happens if you simply execute /lib64/ld-linux-x86-64.so.2 on the command line?
[root@nid200005 host_lib64]# /lib64/ld-linux-x86-64.so.2
Usage: ld.so [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked `ld.so', the helper program for shared library executables.
This program usually lives in the file `/lib/ld.so', and special directives
in executable files using ELF shared libraries tell the system's program
loader to load the helper program from this file. This helper program loads
the shared libraries needed by the program executable, prepares the program
to run, and runs it. You may invoke this helper program directly from the
command line to load and run an ELF executable file; this is like executing
that file itself, but always uses this helper program from the file you
specified, instead of the helper program file specified in the executable
file you run. This is mostly of use for maintainers to test new versions
of this helper program; chances are you did not intend to run this program.
--list list all dependencies and how they are resolved
--verify verify that given object really is a dynamically linked
object we can handle
--inhibit-cache Do not use /etc/ld.so.cache
--library-path PATH use given PATH instead of content of the environment
variable LD_LIBRARY_PATH
--inhibit-rpath LIST ignore RUNPATH and RPATH information in object names
in LIST
--audit LIST use objects named in LIST as auditors
READ THE OUTPUT CAREFULLY! The solution is right there. It is as simple as running the executable with the alternate ld.so like the following:
[root@nid200005 /]# /scratch/host_lib64/ld-linux-x86-64.so.2 \
> --library-path /scratch/host_lib64:/opt/udiImage/modules/mpich:/opt/udiImage/modules/mpich/dep \
> /app/check-mpi
Hello from rank 0, on nid200005. (core affinity = 0-255)dingpf@muller:login02:/mscratch/sd/d/dingpf/mpich-glibc-swap> salloc --nodes 2 --qos interactive --time 04:00:00 --constraint cpu
salloc: Pending job allocation 893410
salloc: job 893410 queued and waiting for resources
salloc: job 893410 has been allocated resources
salloc: Granted job allocation 893410
salloc: Waiting for resource configuration
salloc: Nodes nid[200003-200004] are ready for job
dingpf@nid200003:/mscratch/sd/d/dingpf/mpich-glibc-swap> cat script/srun-glibc-swap-fedora.sh
srun --ntasks-per-node 4 -N 2 podman-hpc run --rm --mpi \
-v $SCRATCH:/scratch \
ghcr.io/dingp/fedora:26-mpich \
/scratch/host_lib64/ld-linux-x86-64.so.2 \
--library-path /scratch/host_lib64:/opt/udiImage/modules/mpich:/opt/udiImage/modules/mpich/dep \
/app/check-mpi
dingpf@nid200003:/mscratch/sd/d/dingpf/mpich-glibc-swap> . script/srun-glibc-swap.sh
Hello from rank 3, on nid200003. (core affinity = 0-255)
Hello from rank 2, on nid200003. (core affinity = 0-255)
Hello from rank 1, on nid200003. (core affinity = 0-255)
Hello from rank 5, on nid200004. (core affinity = 0-255)
Hello from rank 6, on nid200004. (core affinity = 0-255)
Hello from rank 7, on nid200004. (core affinity = 0-255)
Hello from rank 4, on nid200004. (core affinity = 0-255)
Hello from rank 0, on nid200003. (core affinity = 0-255)
It works!
Exploring this topic was a fun experience like going down a rabbit hole.
The solution here contains two key parts:
- bring the full suite of GLIBC libraries from the host, including the matched
ld.so; - use the alternate
ld.soto run he executable.
The full example is available in this repository, which includes:
container/fedora-26.Dockerfile: A Dockerfile for the container image based onfedora:26.container/ubuntu-16.04.Dockerfile: A Dockerfile for the container image based onubuntu:16.04.app/xthi-mpi.c: Source code for a simple MPI application used for testing (sourced from NERSC documentation).script/create_host_lib64.sh: A script to gather GLIBC library bundles from the host (valid formullerorperlmutteras of 2024-12-07). Note: This script could be improved to eliminate hard-coded library versions.- Three scripts designed for compute nodes:
script/run-fedora-mpi-it.sh: Runs the container interactively with all required libraries volume-mounted, enabling live testing.script/srun-glibc-swap-fedora.sh: Executes the MPI application using the appropriateld.soloader (usingfedora:26base image).script/srun-no-glibc-swap.sh: Demonstrates failure due to mismatchedGLIBCversions required by the MPI libraries or their dependencies.script/run-ubuntu-mpi-it.sh: Similar as above, but for the image based onubuntu:16.04;script/srun-glibc-swap-ubuntu.sh: Similar as above, but for the image based onubuntu:16.04;
While searching for a base image with an older GLIBC version, I initially considered RHEL7 equivalents like ScientificLinux 7 and CentOS 7. Unfortunately, these proved unusable due to the lack of active repository mirrors. Without access to mirrors, a minimal DockerHub image becomes useless, as gcc and other utilities are required. Fortunately, Fedora 26 still provides active mirrors and includes an appropriately aged GLIBC version.
- Test with GPU-aware cray-mpich on GPU nodes;
- A wild test is to do this with a Linux distro using the
muslimplementation of libc. ReplacingGLIBCwithmuslhere for all the swapped MPI libraries and dependencies is unlikely going to work, as that depends a lot on those libraries to not use any features exclusively implemented inGLIBC...