Skip to content

dingp/mpich-glibc-swap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bringing Newer glibc into Older Images for MPICH Host Swap

The Issue to Solve

For containerized MPI applications, the host MPI library and its dependencies are swapped into the container at runtime to enable cross-node communication. However, one critical dependency not brought in is glibc. Due to glibc's backward compatibility, the swapped-in MPI library functions correctly in container images with newer glibc versions. But for images with older glibc, applications often fail to run with the swapped-in MPI library. The glibc version in the image must meet or exceed the highest version required by the MPI library or its dependencies.

The error might look like this:

/app/check-mpi: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /opt/udiImage/modules/mpich/dep/libfabric.so.1)
/app/check-mpi: /lib64/libc.so.6: version `GLIBC_2.26' not found (required by /opt/udiImage/modules/mpich/dep/libfabric.so.1)
/app/check-mpi: /lib64/libm.so.6: version `GLIBC_2.26' not found (required by /opt/udiImage/modules/mpich/dep/libgfortran.so.5)
/app/check-mpi: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /opt/udiImage/modules/mpich/dep/libcxi.so.1)
/app/check-mpi: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /opt/udiImage/modules/mpich/dep/libssh.so.4)
/app/check-mpi: /lib64/libc.so.6: version `GLIBC_2.27' not found (required by /opt/udiImage/modules/mpich/dep/libgssapi_krb5.so.2)
/app/check-mpi: /lib64/libselinux.so.1: no version information available (required by /opt/udiImage/modules/mpich/dep/libkrb5support.so.0)

In this example, the application requires glibc version 2.27 or above, but the image contains glibc version 2.25.

[root@nid200005 app]# /lib64/libc-2.25.so
GNU C Library (GNU libc) stable release version 2.25, by Roland McGrath et al.
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 7.2.1 20170915 (Red Hat 7.2.1-2).
Available extensions:
        crypt add-on version 2.1 by Michael Glad and others
        GNU Libidn by Simon Josefsson
        Native POSIX Threads Library by Ulrich Drepper et al
        BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

You can verify this using:

objdump -T <shared_lib> | grep GLIBC | sed 's/.*GLIBC_\([.0-9]*\).*/\1/g' | sort -V | tail -n 1

This command will reveal the highest glibc version required by the application’s dependencies. In this case, the maximum requirement is GLIBC_2.27 for libfabric.so.

The following commands lists the glibc version requirements for each swapped-in library:

for i in $(ls /usr/lib/shifter/mpich-2.2/*.so*); do
  echo $i;
  objdump -T $i | grep GLIBC | sed 's/.*GLIBC_\([.0-9]*\).*/\1/g' | sort -V | tail -n 1;
done
for i in $(ls /usr/lib/shifter/mpich-2.2/dep/*.so*); do
  echo $i;
  objdump -T $i | grep GLIBC | sed 's/.*GLIBC_\([.0-9]*\).*/\1/g' | sort -V | tail -n 1;
done

Constraints on Possible Solutions

  • Runtime Modifications: Modifying the image at runtime using tools like patchelf is impractical, especially for large-scale jobs (srun -n <many_num_jobs>).
  • Image Rebuild: Rebuilding the image with pre-applied fixes is possible but not ideal. A better solution avoids rebuilding.

First Attempt: Bring a Newer libc.so

Adding only libc.so.6 from the host fails because glibc involves multiple interdependent libraries. For example:

[root@nid200005 scratch]# LD_LIBRARY_PATH=/scratch/libc_only:$LD_LIBRARY_PATH /app/check-mpi
/app/check-mpi: /lib64/libm.so.6: version `GLIBC_2.26' not found (required by /opt/udiImage/modules/mpich/dep/libgfortran.so.5)

Here, libgfortran.so.5 depends on libm.so with version GLIBC_2.26.

Second Attempt: Bring All Related glibc Libraries

Even with all related libraries, the following error occurs:

[root@nid200005 scratch]# LD_LIBRARY_PATH=/scratch/libc_libm:$LD_LIBRARY_PATH /app/check-mpi
/app/check-mpi: /lib64/libselinux.so.1: no version information available (required by /opt/udiImage/modules/mpich/dep/libkrb5support.so.0)
/app/check-mpi: relocation error: /scratch/libc_libm/libc.so.6: symbol _dl_exception_create, version GLIBC_PRIVATE not defined in file ld-linux-x86-64.so.2 with link time reference

This happens because ld-linux.so.2 and libc.so.6 are mismatched. The executable uses the hardcoded ld-linux.so.2 path from its link time, ignoring alternatives in LD_LIBRARY_PATH.

Third Attempt - Can We Overwrite the System ld-linux.so.2?

Nice try! You cannot.

[root@nid200005 scratch]# cp /scratch/host_lib64/ld-2.31.so /lib64/
[root@nid200005 scratch]# cp /scratch/host_lib64/ld-linux-x86-64.so.2 /lib64/
cp: overwrite '/lib64/ld-linux-x86-64.so.2'? y
cp: cannot create regular file '/lib64/ld-linux-x86-64.so.2': Text file busy

The reason is simple: almost every binary depends on it! Even cp itself relies on it.

[root@nid200005 /]# ldd /bin/cp
        linux-vdso.so.1 (0x00007ffd965b1000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f0981336000)
        libacl.so.1 => /lib64/libacl.so.1 (0x00007f098112d000)
        libattr.so.1 => /lib64/libattr.so.1 (0x00007f0980f28000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f0980b53000)
        libpcre.so.1 => /lib64/libpcre.so.1 (0x00007f09808e1000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f09806dd000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f0981780000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f09804be000)

If we cannot change the system ld-linux-x86-64.so.2, how can we use an alternate version?


Fourth Attempt - Using an Alternate ld-linux.so.2

The path to the interpreter (ld-linux.so.2) is hardcoded into the binary at link time. The obvious solution is to use patchelf to rewrite it. However, we want to avoid modifying the binaries directly.

For reference, here’s how to use patchelf for this purpose. Note that this example includes the full suite of GLIBC libraries:

# Using release-0.13 branch of https://github.com/NixOS/patchelf.git
# ./bootstrap.sh && ./configure && make -j20 && make install
[root@nid200005 app]# patchelf --set-interpreter /scratch/host_lib64/ld-2.31.so check-mpi
[root@nid200005 app]# LD_LIBRARY_PATH=/scratch/host_lib64 ./check-mpi
Hello from rank 0, on nid200005. (core affinity = 0-255)

This brings us to the final solution. What happens if you simply execute /lib64/ld-linux-x86-64.so.2 on the command line?

[root@nid200005 host_lib64]# /lib64/ld-linux-x86-64.so.2
Usage: ld.so [OPTION]... EXECUTABLE-FILE [ARGS-FOR-PROGRAM...]
You have invoked `ld.so', the helper program for shared library executables.
This program usually lives in the file `/lib/ld.so', and special directives
in executable files using ELF shared libraries tell the system's program
loader to load the helper program from this file.  This helper program loads
the shared libraries needed by the program executable, prepares the program
to run, and runs it.  You may invoke this helper program directly from the
command line to load and run an ELF executable file; this is like executing
that file itself, but always uses this helper program from the file you
specified, instead of the helper program file specified in the executable
file you run.  This is mostly of use for maintainers to test new versions
of this helper program; chances are you did not intend to run this program.

  --list                list all dependencies and how they are resolved
  --verify              verify that given object really is a dynamically linked
                        object we can handle
  --inhibit-cache       Do not use /etc/ld.so.cache
  --library-path PATH   use given PATH instead of content of the environment
                        variable LD_LIBRARY_PATH
  --inhibit-rpath LIST  ignore RUNPATH and RPATH information in object names
                        in LIST
  --audit LIST          use objects named in LIST as auditors

READ THE OUTPUT CAREFULLY! The solution is right there. It is as simple as running the executable with the alternate ld.so like the following:

[root@nid200005 /]# /scratch/host_lib64/ld-linux-x86-64.so.2 \
> --library-path /scratch/host_lib64:/opt/udiImage/modules/mpich:/opt/udiImage/modules/mpich/dep \
> /app/check-mpi
Hello from rank 0, on nid200005. (core affinity = 0-255)

Verifying the Solution with Multi-Node MPI

dingpf@muller:login02:/mscratch/sd/d/dingpf/mpich-glibc-swap> salloc --nodes 2 --qos interactive --time 04:00:00 --constraint cpu
salloc: Pending job allocation 893410
salloc: job 893410 queued and waiting for resources
salloc: job 893410 has been allocated resources
salloc: Granted job allocation 893410
salloc: Waiting for resource configuration
salloc: Nodes nid[200003-200004] are ready for job
dingpf@nid200003:/mscratch/sd/d/dingpf/mpich-glibc-swap> cat script/srun-glibc-swap-fedora.sh
srun --ntasks-per-node 4 -N 2  podman-hpc run --rm  --mpi \
-v $SCRATCH:/scratch \
ghcr.io/dingp/fedora:26-mpich \
/scratch/host_lib64/ld-linux-x86-64.so.2 \
--library-path /scratch/host_lib64:/opt/udiImage/modules/mpich:/opt/udiImage/modules/mpich/dep \
/app/check-mpi

dingpf@nid200003:/mscratch/sd/d/dingpf/mpich-glibc-swap> . script/srun-glibc-swap.sh
Hello from rank 3, on nid200003. (core affinity = 0-255)
Hello from rank 2, on nid200003. (core affinity = 0-255)
Hello from rank 1, on nid200003. (core affinity = 0-255)
Hello from rank 5, on nid200004. (core affinity = 0-255)
Hello from rank 6, on nid200004. (core affinity = 0-255)
Hello from rank 7, on nid200004. (core affinity = 0-255)
Hello from rank 4, on nid200004. (core affinity = 0-255)
Hello from rank 0, on nid200003. (core affinity = 0-255)

It works!

Wrap-up

Exploring this topic was a fun experience like going down a rabbit hole.

The solution here contains two key parts:

  • bring the full suite of GLIBC libraries from the host, including the matched ld.so;
  • use the alternate ld.so to run he executable.

The full example is available in this repository, which includes:

Fun Fact

While searching for a base image with an older GLIBC version, I initially considered RHEL7 equivalents like ScientificLinux 7 and CentOS 7. Unfortunately, these proved unusable due to the lack of active repository mirrors. Without access to mirrors, a minimal DockerHub image becomes useless, as gcc and other utilities are required. Fortunately, Fedora 26 still provides active mirrors and includes an appropriately aged GLIBC version.

Next to do

  • Test with GPU-aware cray-mpich on GPU nodes;
  • A wild test is to do this with a Linux distro using the musl implementation of libc. Replacing GLIBC with musl here for all the swapped MPI libraries and dependencies is unlikely going to work, as that depends a lot on those libraries to not use any features exclusively implemented in GLIBC...

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages