Skip to content

Jailer: Incorrect handling of bind mounts within the rootfs #1089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mcastelino opened this issue May 10, 2019 · 11 comments · Fixed by #1093
Closed

Jailer: Incorrect handling of bind mounts within the rootfs #1089

mcastelino opened this issue May 10, 2019 · 11 comments · Fixed by #1093
Assignees
Labels
Type: Bug Indicates an unexpected problem or unintended behavior

Comments

@mcastelino
Copy link
Contributor

In Kata containers we bind mount device mapper devices into the chroot location.
This is needed as

  • hard links cannot cross file system boundaries
  • copy is not feasible as this is actually a block device

Jailer does not seem to be able to handle this correctly.
The same bind mount is handled properly when using without jailer (i.e. just firecracker)

Below is the file hierarchies with and without jailer.

Here drive_0's are bind mounted to device mapper device nodes, which are then passed as drives to firecracker

brw-rw---- 1 root disk 253, 0 May  3 15:22 /dev/dm-0
brw-rw---- 1 root disk 253, 1 May  3 15:22 /dev/dm-1
brw-rw---- 1 root disk 253, 2 May 10 15:48 /dev/dm-2
brw-rw---- 1 root disk 253, 3 May 10 15:49 /dev/dm-3
/var/lib/firecracker/
├── 34cbb0f3993d35148fb1c5ee424ae97c4f0fd956b8d93079a6f136b1cd38d9ad
│   └── root
│       ├── api.socket
│       ├── drive_0
│       ├── drive_1
│       ├── drive_2
│       ├── drive_3
│       ├── drive_4
│       ├── drive_5
│       ├── drive_6
│       ├── drive_7
│       ├── kata-containers-image_clearlinux_1.7.0-alpha1_agent_e3967e783b9.img
│       └── vmlinux-4.19.28-37
└── 57ab234c96ffab1dddf141d7400234ab310f09a736cee2d0de66f4117ce33e9e
    └── root
        ├── api.socket
        ├── dev
        │   ├── kvm
        │   ├── net
        │   │   └── tun
        │   └── vhost-vsock
        ├── drive_0
        ├── drive_1
        ├── drive_2
        ├── drive_3
        ├── drive_4
        ├── drive_5
        ├── drive_6
        ├── drive_7
        ├── firecracker
        ├── kata-containers-image_clearlinux_1.7.0-alpha1_agent_e3967e783b9.img
        └── vmlinux-4.19.28-37
@mcastelino
Copy link
Contributor Author

/cc @andreeaflorescu @alexandruag

@mcastelino
Copy link
Contributor Author

Note: Disabling seccomp does not help "--seccomp-level", "0"

@alxiord
Copy link

alxiord commented May 13, 2019

Indeed seccomp has nothing to do with this. I think it's because the jailer bind mounts the jail over itself unrecursively. It's easy to reproduce the behavior without firecracker:

# Create a jail and a mountpoint
mkdir jail
touch jail/device
file jail/device 
jail/device: empty
# Mount a device in the jail
sudo mount --bind /dev/random jail/device 
file jail/device 
jail/device: character special (1/8)
# Mount the jail over itself
sudo mount --bind jail jail
file jail/device 
jail/device: empty

The device is no longer properly mounted - looks like your use case.
If we bind mount recursively:

sudo mount --rbind jail jail
file jail/device 
jail/device: character special (1/8)

@alxiord alxiord added Feature: Jailing Type: Bug Indicates an unexpected problem or unintended behavior labels May 13, 2019
@alxiord alxiord self-assigned this May 13, 2019
@alxiord
Copy link

alxiord commented May 13, 2019

Nope, I'm wrong, the jailer keeps the mount intact. @mcastelino can you help with the sequence of operations that create the mounts inside the jail so I can reproduce exactly what you're seeing?

mkdir /srv/jailer/demo_jail/
touch /srv/jailer/demo_jail/mountpoint
file /srv/jailer/demo_jail/mountpoint
/srv/jailer/demo_jail/mountpoint: empty
sudo mount --bind /dev/dm-0  /srv/jailer/demo_jail/mountpoint
file /srv/jailer/demo_jail/mountpoint
/srv/jailer/demo_jail/mountpoint: block special (253/0)
sudo target/x86_64-unknown-linux-musl/debug/jailer --id demo --exec-file $PWD/target/x86_64-unknown-linux-musl/debug/firecracker --node 0 --uid 1234 --gid 1234 --chroot-base-dir /srv/jailer/demo_jail 
file /srv/jailer/demo_jail/mountpoint 
/srv/jailer/demo_jail/mountpoint: block special (253/0)

@mcastelino
Copy link
Contributor Author

@aghecenco here is a simple sequence of operations.

Run jailer two ways

jailer.sh

curl -fsSL -o hello-vmlinux.bin https://s3.amazonaws.com/spec.ccfc.min/img/hello/kernel/hello-vmlinux.bin
curl -fsSL -o hello-rootfs.ext4 https://s3.amazonaws.com/spec.ccfc.min/img/hello/fsfiles/hello-rootfs.ext4

sudo rm -rf /var/lib/jailer
#Hold original content
sudo mkdir -p /var/lib/jailer/testjail

#Holds the jailed content
sudo mkdir -p /var/lib/jailer/firecracker/testhardlink/root
sudo mkdir -p /var/lib/jailer/firecracker/testbindmount/root


sudo -E cp ./hello-vmlinux.bin /var/lib/jailer/testjail
sudo -E cp ./hello-rootfs.ext4 /var/lib/jailer/testjail

#Using hardlinks
sudo -E ln /var/lib/jailer/testjail/hello-vmlinux.bin /var/lib/jailer/firecracker/testhardlink/root/hello-vmlinux.bin
sudo -E ln /var/lib/jailer/testjail/hello-rootfs.ext4 /var/lib/jailer/firecracker/testhardlink/root/hello-rootfs.ext4

#Using bindmount
sudo -E touch /var/lib/jailer/firecracker/testbindmount/root/hello-vmlinux.bin
sudo -E touch /var/lib/jailer/firecracker/testbindmount/root/hello-rootfs.ext4
sudo -E mount --bind /var/lib/jailer/testjail/hello-vmlinux.bin /var/lib/jailer/firecracker/testbindmount/root/hello-vmlinux.bin
sudo -E mount --bind /var/lib/jailer/testjail/hello-rootfs.ext4 /var/lib/jailer/firecracker/testbindmount/root/hello-rootfs.ext4

sudo -E $HOME/firecracker/build/debug/jailer \
       --id testhardlink \
       --node 0 \
       --exec-file $HOME/firecracker/build/debug/firecracker \
       --uid 0 \
       --gid 0 \
       --chroot-base-dir /var/lib/jailer \
       --seccomp-level 0 \
       --daemonize

sudo -E $HOME/firecracker/build/debug/jailer \
       --id testbindmount \
       --node 0 \
       --exec-file $HOME/firecracker/build/debug/firecracker \
       --uid 0 \
       --gid 0 \
       --chroot-base-dir /var/lib/jailer \
       --seccomp-level 0 \
       --daemonize

Use the following script to launch the VM

launch.sh

curl --unix-socket /var/lib/jailer/firecracker/$1/root/api.socket -i \
    -X PUT 'http://localhost/boot-source'   \
    -H 'Accept: application/json'           \
    -H 'Content-Type: application/json'     \
    -d '{
        "kernel_image_path": "./hello-vmlinux.bin",
        "boot_args": "console=ttyS0 reboot=k panic=1 pci=off"
    }'


curl --unix-socket /var/lib/jailer/firecracker/$1/root/api.socket -i \
    -X PUT 'http://localhost/drives/rootfs' \
    -H 'Accept: application/json'           \
    -H 'Content-Type: application/json'     \
    -d '{
        "drive_id": "rootfs",
        "path_on_host": "./hello-rootfs.ext4",
        "is_root_device": true,
        "is_read_only": false
    }'

curl --unix-socket /var/lib/jailer/firecracker/$1/root/api.socket -i \
    -X PUT 'http://localhost/actions'       \
    -H  'Accept: application/json'          \
    -H  'Content-Type: application/json'    \
    -d '{
        "action_type": "InstanceStart"
     }'

Launch VM through jailer using hardlinks (it works)

sudo ./launch testhardlink
+ curl --unix-socket /var/lib/jailer/firecracker/testhardlink/root/api.socket -i -X PUT http://localhost/boot-source -H Accept: application/json -H Content-Type: application/json -d {
        "kernel_image_path": "./hello-vmlinux.bin",
        "boot_args": "console=ttyS0 reboot=k panic=1 pci=off"
    }
HTTP/1.1 204 No Content
Date: Mon, 13 May 2019 19:21:20 GMT

+ curl --unix-socket /var/lib/jailer/firecracker/testhardlink/root/api.socket -i -X PUT http://localhost/drives/rootfs -H Accept: application/json -H Content-Type: application/json -d {
        "drive_id": "rootfs",
        "path_on_host": "./hello-rootfs.ext4",
        "is_root_device": true,
        "is_read_only": false
    }
HTTP/1.1 204 No Content
Date: Mon, 13 May 2019 19:21:20 GMT

+ curl --unix-socket /var/lib/jailer/firecracker/testhardlink/root/api.socket -i -X PUT http://localhost/actions -H Accept: application/json -H Content-Type: application/json -d {
        "action_type": "InstanceStart"
     }
HTTP/1.1 204 No Content
Date: Mon, 13 May 2019 19:21:20 GMT

Launch with bindmount which will fail

#Kill the previous instance and launch a new one
sudo killall firecracker 
sudo sh -x launch.sh testbindmount
+ curl --unix-socket /var/lib/jailer/firecracker/testbindmount/root/api.socket -i -X PUT http://localhost/boot-source -H Accept: application/json -H Content-Type: application/json -d {
        "kernel_image_path": "./hello-vmlinux.bin",
        "boot_args": "console=ttyS0 reboot=k panic=1 pci=off"
    }
HTTP/1.1 204 No Content
Date: Mon, 13 May 2019 19:19:55 GMT

+ curl --unix-socket /var/lib/jailer/firecracker/testbindmount/root/api.socket -i -X PUT http://localhost/drives/rootfs -H Accept: application/json -H Content-Type: application/json -d {
        "drive_id": "rootfs",
        "path_on_host": "./hello-rootfs.ext4",
        "is_root_device": true,
        "is_read_only": false
    }
HTTP/1.1 204 No Content
Date: Mon, 13 May 2019 19:19:55 GMT

+ curl --unix-socket /var/lib/jailer/firecracker/testbindmount/root/api.socket -i -X PUT http://localhost/actions -H Accept: application/json -H Content-Type: application/json -d {
        "action_type": "InstanceStart"
     }
HTTP/1.1 400 Bad Request
Content-Type: application/json
Transfer-Encoding: chunked
Date: Mon, 13 May 2019 19:19:55 GMT

{
  "fault_message": "Cannot load kernel due to invalid memory configuration or invalid kernel image. Failed to read ELF header"

@mcastelino
Copy link
Contributor Author

See the last error, that is coming from firecracker being unable to read the backing file

"fault_message": "Cannot load kernel due to invalid memory configuration or invalid kernel image. Failed to read ELF header"

Further the jailed locations are setup correctly as seen below

root@mrcastel-DESK:/var/lib# tree jailer/
jailer/
├── firecracker
│   ├── testbindmount
│   │   └── root
│   │       ├── api.socket
│   │       ├── dev
│   │       │   ├── kvm
│   │       │   ├── net
│   │       │   │   └── tun
│   │       │   └── vhost-vsock
│   │       ├── firecracker
│   │       ├── hello-rootfs.ext4
│   │       └── hello-vmlinux.bin
│   └── testhardlink
│       └── root
│           ├── api.socket
│           ├── dev
│           │   ├── kvm
│           │   ├── net
│           │   │   └── tun
│           │   └── vhost-vsock
│           ├── firecracker
│           ├── hello-rootfs.ext4
│           └── hello-vmlinux.bin
└── testjail
    ├── hello-rootfs.ext4
    └── hello-vmlinux.bin

10 directories, 16 files
root@mrcastel-DESK:/var/lib# ls -alp jailer/*/*/root
jailer/firecracker/testbindmount/root:
total 103684
drwxr-xr-x 3 root root     4096 May 13 12:21 ./
drwxr-xr-x 3 root root     4096 May 13 12:21 ../
srwxr-xr-x 1 root root        0 May 13 12:21 api.socket
drwxr-xr-x 3 root root     4096 May 13 12:21 dev/
-rwxr-xr-x 1 root root 53435664 May 13 12:21 firecracker
-rw-r--r-- 2 root root 31457280 May 13 12:21 hello-rootfs.ext4
-rw-r--r-- 2 root root 21266136 May 13 12:21 hello-vmlinux.bin

jailer/firecracker/testhardlink/root:
total 103684
drwxr-xr-x 3 root root     4096 May 13 12:21 ./
drwxr-xr-x 3 root root     4096 May 13 12:21 ../
srwxr-xr-x 1 root root        0 May 13 12:21 api.socket
drwxr-xr-x 3 root root     4096 May 13 12:21 dev/
-rwxr-xr-x 1 root root 53435664 May 13 12:21 firecracker
-rw-r--r-- 2 root root 31457280 May 13 12:21 hello-rootfs.ext4
-rw-r--r-- 2 root root 21266136 May 13 12:21 hello-vmlinux.bin

@mcastelino
Copy link
Contributor Author

mcastelino commented May 13, 2019

Another thing to note is that if you change the bind mount to shared for the pivot root, this test will pass if the file already exists in the root. However if the files are dropped into the root after firecracker has been launched, we will still see the same issue. In kata we drop in the files after the firecracker launch (but before instance start). Also we patch drives after instance start. So in both cases the bind mount happens post jail creation.

    // Bind mount the jail root directory over itself, so we can go around a restriction
    // imposed by pivot_root, which states that the new root and the old root should not
    // be on the same filesystem. Safe because we provide valid parameters.
    SyscallReturnCode(unsafe {
        libc::mount(
            chroot_dir.as_ptr(),
            chroot_dir.as_ptr(),
            null(),
            libc::MS_SHARED | libc::MS_REC | libc::MS_BIND,
            null(),
        )
    })
    .into_empty_result()
    .map_err(Error::MountBind)?;

mcastelino added a commit to mcastelino/firecracker that referenced this issue May 13, 2019
User can bind mount into the chroot location.
This is needed as hard links cannot cross file system boundaries.
Copy is not always feasible (e.g. block devices).

Change the bind mount to be slave, such that host to jail bind
mounts are properly propagated. However we do not want to jail
to host events to propgate back.

Fixes: firecracker-microvm#1089

Signed-off-by: Manohar Castelino <[email protected]>
@mcastelino
Copy link
Contributor Author

@aghecenco @andreeaflorescu making the jailed mount as slave solves this issue.
#1093

Please let me know if this still meets the security criterion. This protects the host from the container while allowing the host the freedom to update the jail from the outside.

@alxiord
Copy link

alxiord commented May 14, 2019

this test will pass if the file already exists in the root

@mcastelino that's what I missed while testing. Thank you for the detailed investigation and the PR! 👍

@alxiord alxiord assigned mcastelino and alxiord and unassigned alxiord May 14, 2019
alxiord pushed a commit to alxiord/firecracker that referenced this issue May 15, 2019
User can bind mount into the chroot location.
This is needed as hard links cannot cross file system boundaries.
Copy is not always feasible (e.g. block devices).

Change the bind mount to be slave, such that host to jail bind
mounts are properly propagated. However we do not want to jail
to host events to propgate back.

Fixes: firecracker-microvm#1089

Signed-off-by: Manohar Castelino <[email protected]>
alxiord pushed a commit to alxiord/firecracker that referenced this issue May 15, 2019
User can bind mount into the chroot location.
This is needed as hard links cannot cross file system boundaries.
Copy is not always feasible (e.g. block devices).

Change the bind mount to be slave, such that host to jail bind
mounts are properly propagated. However we do not want to jail
to host events to propgate back.

Fixes: firecracker-microvm#1089

Signed-off-by: Manohar Castelino <[email protected]>
mcastelino added a commit to mcastelino/firecracker that referenced this issue May 15, 2019
User can bind mount into the chroot location.
This is needed as hard links cannot cross file system boundaries.
Copy is not always feasible (e.g. block devices).

Change the bind mount to be slave, such that host to jail bind
mounts are properly propagated. However we do not want to jail
to host events to propgate back.

Fixes: firecracker-microvm#1089

Signed-off-by: Manohar Castelino <[email protected]>
@alxiord alxiord assigned alexandruag and unassigned alxiord May 16, 2019
mcastelino added a commit to mcastelino/firecracker that referenced this issue May 16, 2019
User can bind mount into the chroot location.
This is needed as hard links cannot cross file system boundaries.
Copy is not always feasible (e.g. block devices).

Change the bind mount to be slave, such that host to jail bind
mounts are properly propagated. However we do not want to jail
to host events to propgate back.

Fixes: firecracker-microvm#1089

Signed-off-by: Manohar Castelino <[email protected]>
@alexandruag
Copy link
Contributor

alexandruag commented May 17, 2019

Hi @mcastelino,

We stared a bit longer at this, and it turns out there are two things which have to be fixed:

  • Relative to your Run jailer two ways example, where you bind mount stuff before running the jailer (because you know & control the jail root path that's going to be used on the host), the problem is that when we bind mount the jail root folder on top of itself in
// Bind mount the jail root directory over itself, so we can go around a restriction
// imposed by pivot_root, which states that the new root and the old root should not
// be on the same filesystem. Safe because we provide valid parameters.
SyscallReturnCode(unsafe {
    libc::mount(
        chroot_dir.as_ptr(),
        chroot_dir.as_ptr(),
        null(),
        libc::MS_BIND,
        null(),
    )
})
.into_empty_result()
.map_err(Error::MountBind)?;

we don't also supply the MS_REC flag, so any inner bind mounts (like the ones you just created) are effectively discarded.

  • The second issue appears when you bind mount things in the parent namespace after the jailer is started, because we change the propagation to MS_PRIVATE, so the two mount namespaces become disconnected, and that's what your PR fixes in its current form.

Can you please update your PR to also add the MS_REC flag to the bind operation? (also seems like you have to run cargo fmt to fix the formatting which is checked automatically by our CI).

mcastelino added a commit to mcastelino/firecracker that referenced this issue May 17, 2019
User can bind mount into the chroot location.
This is needed as hard links cannot cross file system boundaries.
Copy is not always feasible (e.g. block devices).

Change the bind mount to be slave, such that host to jail bind
mounts are properly propagated. However we do not want to jail
to host events to propgate back.

Fixes: firecracker-microvm#1089

Signed-off-by: Manohar Castelino <[email protected]>
@mcastelino
Copy link
Contributor Author

Can you please update your PR to also add the MS_REC flag to the bind operation? (also seems like you have to run cargo fmt to fix the formatting which is checked automatically by our CI).

@alexandruag yes I missed the third case. I fixed as you suggested and also done the formatting.

alxiord pushed a commit that referenced this issue May 20, 2019
User can bind mount into the chroot location.
This is needed as hard links cannot cross file system boundaries.
Copy is not always feasible (e.g. block devices).

Change the bind mount to be slave, such that host to jail bind
mounts are properly propagated. However we do not want to jail
to host events to propgate back.

Fixes: #1089

Signed-off-by: Manohar Castelino <[email protected]>
anthonycorletti added a commit to anthonycorletti/firecracker that referenced this issue Feb 11, 2025
…ating mounts for the guest kernel and rootfs and mounting them to the jailer root as mentioned in firecracker-microvm#1089 Signed-off-by: Anthony Corletti <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants