|
| 1 | +# Container-Bootfs ([casync](https://github.com/systemd/casync) and [desync](https://github.com/folbricht/desync) inside.) |
| 2 | + |
| 3 | +## About bootfs |
| 4 | +Container Bootfs(bootfs below.) aims to achieve block-level image chunking, de-duplication and lazy-pull execution without any modification on container runtime or registry. |
| 5 | +The status of this project is __rough PoC__, and the bootfs is imcompelete as mentioned later. |
| 6 | +Currently, we levarage [casync](https://github.com/systemd/casync) and [desync](https://github.com/folbricht/desync) for provisioning rootfs and lazy-pulling image data. |
| 7 | + |
| 8 | +## Context |
| 9 | +Now, next generation of OCI image spec is under active discussion! |
| 10 | +- https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/icXssT3zQxE |
| 11 | +- https://github.com/openSUSE/umoci/issues/256 |
| 12 | + |
| 13 | +Roughly speaking, some of points which has been frequentry discussed about current OCI are following. |
| 14 | +- Inefficiency of layer-level de-duplication. |
| 15 | +- Slow container startup time caused by lack of lazy-pull functionality. |
| 16 | +- Lack of seek functionality on tar archive format. |
| 17 | + |
| 18 | +So far, several effective concepts has been proposed.(alphabetical order) |
| 19 | +- CernVM-FS Graph Driver Plugin for Docker : [https://cvmfs.readthedocs.io/en/stable/cpt-graphdriver.html](https://cvmfs.readthedocs.io/en/stable/cpt-graphdriver.html) |
| 20 | +- FILEgrain : [https://github.com/AkihiroSuda/filegrain](https://github.com/AkihiroSuda/filegrain) |
| 21 | +- Slacker : [https://www.usenix.org/node/194431](https://www.usenix.org/node/194431) |
| 22 | +- sqashfs and stacker : [https://fosdem.org/2019/schedule/event/containers_atomfs/](https://fosdem.org/2019/schedule/event/containers_atomfs/) |
| 23 | +If there are projects should be addded, please let me know. |
| 24 | + |
| 25 | +## Bootfs is aiming... |
| 26 | +We aimed to achieve contaiainer image's *block-level de-duplication on store*, *on transfer* and *on execution node* and the *lazy-pull execution* *without any modification on contaienr runtime or registry*. |
| 27 | +To achieve this, we developed image converter which generates following data. |
| 28 | +- Boot image: Generated Docker image. This image include __boot__ program which sets up the execution environment using casync and desync (both of them are also included in the image) then runs original ENTRYPOINT in the container on boot. Casync provisions original image's rootfs using FUSE with included metadata (aka [caibx](https://github.com/systemd/casync#file-suffixes)). Most of image data will be *pulled lazily* and cached locally, from __Remote store__ on access, by desync process. We use desync's [cache functionailty](https://github.com/folbricht/desync#caching), so if some blobs are on node, desync just use the blobs without pulling remotely, which leads to *block-level de-duplication on transfer*. If you use container's Volume as local cache, the __local cache__ can be shared with several containers on node, then you can achieve *block-level inter-container de-duplication on the node*. This image follows [Docker image spec](https://github.com/moby/moby/blob/master/image/spec/v1.2.md), so you can pull and run this image from container registry in very normal way *without modification on the container runtime or registry*. |
| 29 | +- Rootfs blobs : Block-level CDC-chunked image's rootfs blobs. We use casync for chunking. Put this blobs on somewhere cluster-grobal storage (we call it __Remote store__). If you store sets of blobs generated by some containers on the same store, you can acheve *block-level de-duplication on the store*. |
| 30 | + |
| 31 | + |
| 32 | +At runtime, the boot program sets up execution environment, casync provisions original rootfs using FUSE and desync pulls rootfs data lazily from remote store, as mentioned above. |
| 33 | +The remote store can be anything desync supports. |
| 34 | +In our example, we use SSH server including casync. |
| 35 | +See the sample SSH server container's [Dockerfile in this repo](/sample/ssh/Dockerfile) which is quite simple. |
| 36 | +``` |
| 37 | +FROM rastasheep/ubuntu-sshd:latest |
| 38 | +RUN apt update -y && apt install -y casync |
| 39 | +CMD ["/usr/sbin/sshd", "-D"] |
| 40 | +``` |
| 41 | + |
| 42 | + |
| 43 | +When you use a volume as the local cahce, you can share it among containers on the node. |
| 44 | +The volume needs to be mounted on a specific path (`/rootfs.castr`). |
| 45 | +It is useful to keep the volume using [a simple volume-keeper container](/sample/cache/Dockerfile) like below, and share it among containers. |
| 46 | +``` |
| 47 | +FROM busybox:latest |
| 48 | +RUN mkdir /rootfs.castr |
| 49 | +VOLUME /rootfs.castr |
| 50 | +CMD tail -f /dev/null |
| 51 | +``` |
| 52 | + |
| 53 | + |
| 54 | +## TODO list |
| 55 | +Currently, bootfs is __rough PoC__. |
| 56 | +So currently, the bootfs is not perfect. |
| 57 | +Some of the TODOs are listed below. |
| 58 | +- [ ]We need to evaluate bootfs in quantitive ways (__critical!!!__). |
| 59 | +- [ ]We cannot use container's *volume* functionality if we haven't made mountpoint placeholder (dummy files or directories) on original rootfs in advance, because provisioned rootfs is read-only and we cannot make the placeholder on runtime . |
| 60 | +- [ ]SSH client implimentation is very ad-hoc. Let's say, Desync rely on the system's ssh client and we are using [Dropbear](http://matt.ucc.asn.au/dropbear/dropbear.html) which may be fine. But, we inheriting original rootfs's user information configured in `/etc` (which is including `/etc/passwd` file, etc) without creating bootfs specific one. We also ignore known_hosts checking to achive fast boot time. |
| 61 | +- [ ]We cannot pull each blobs from container registry, which means we cannot merge the boot image and the blobs into one container image and put it on a container registry. Because we rely on desync on pulling blobs, which doesn't talk registry API by default. First we need to extend desync to support registry API, and then merge the boot image and blobs using the way like [FILEgrain project](https://github.com/AkihiroSuda/filegrain) proposes. |
| 62 | +- [ ]Boot image is heavy. But, this would be shared among all containers on node, thanks to container runtime's native layer-level de-duplication functionality. Maybe We need to create lighter binary which includes functionalities of the boot program's setting-up functionality, casync's rootfs-provisioning functionality and desync's lazy-pull functionality. |
| 63 | + |
| 64 | +## Play with sample. |
| 65 | + |
| 66 | +### Preparation. |
| 67 | +We use a volume of the [cache container](/sample/cache/Dockerfile) as node-local blob cache, and a [SSH server container](/sample/ssh/Dockerfile) named ${SSH_SERVER_NAME} as the remote store which use a volume ${SSH_SERVER_STORE}. |
| 68 | +We use ${CONVERTER_OUTPUT_DIR} directory to get the converted image by the image converter. |
| 69 | +``` |
| 70 | +LOCAL_CACHE_NAME=node-local-cache |
| 71 | +LOCAL_CACHE_STORE=/tmp/node-local-cache |
| 72 | +SSH_SERVER_NAME=ssh-casync-server |
| 73 | +SSH_SERVER_STORE=/tmp/ssh-casync-server-store |
| 74 | +CONVERTER_OUTPUT_DIR=/tmp/converter-output |
| 75 | +mkdir ${LOCAL_CACHE_STORE} ${SSH_SERVER_STORE} ${CONVERTER_OUTPUT_DIR} |
| 76 | +``` |
| 77 | +Build the remote store container at `/sample/ssh` and run it. |
| 78 | +``` |
| 79 | +sudo docker build -t ${SSH_SERVER_NAME}:v1 . |
| 80 | +sudo docker run --rm -d --network="bridge" \ |
| 81 | + --name ${SSH_SERVER_NAME} \ |
| 82 | + -v ${SSH_SERVER_STORE}:/store \ |
| 83 | + ${SSH_SERVER_NAME}:v1 |
| 84 | +SSH_SERVER_IP=$(sudo docker inspect ${SSH_SERVER_NAME} --format '{{.NetworkSettings.IPAddress}}') |
| 85 | +``` |
| 86 | +Then build the local cache container at `/sample/cache` and run it. |
| 87 | +``` |
| 88 | +sudo docker build -t ${LOCAL_CACHE_NAME}:v1 . |
| 89 | +sudo docker run --rm -d --name ${LOCAL_CACHE_NAME} \ |
| 90 | + -v ${LOCAL_CACHE_STORE}:/rootfs.castr \ |
| 91 | + ${LOCAL_CACHE_NAME}:v1 |
| 92 | +``` |
| 93 | + |
| 94 | +### Build the image converter as container at `/` of this repo. |
| 95 | +``` |
| 96 | +sudo docker build -t mkimage:latest . |
| 97 | +``` |
| 98 | + |
| 99 | +### Convert images. |
| 100 | +``` |
| 101 | +sudo docker run -i -v /var/run/docker.sock:/var/run/docker.sock \ |
| 102 | + -v ${CONVERTER_OUTPUT_DIR}:/output \ |
| 103 | + mkimage:latest ubuntu:latest ubuntu-converted:latest |
| 104 | +``` |
| 105 | +Then, store the blobs into the remote store container's volume. |
| 106 | +``` |
| 107 | +sudo mv ${CONVERTER_OUTPUT_DIR}/rootfs.castr/* ${SSH_SERVER_STORE}/ |
| 108 | +``` |
| 109 | + |
| 110 | +### Run it. |
| 111 | +``` |
| 112 | +sudo docker run -it --privileged --device /dev/fuse \ |
| 113 | + --volumes-from ${LOCAL_CACHE_NAME} \ |
| 114 | + -e BLOB_STORE=ssh://root@${SSH_SERVER_IP}/store \ |
| 115 | + -e DROPBEAR_PASSWORD=root \ |
| 116 | + ubuntu-converted:latest |
| 117 | +``` |
| 118 | +You can share the local cache among containers by specifying `--volumes-from ${LOCAL_CACHE_NAME}` runtime option. |
| 119 | + |
| 120 | +### Compare how many block-levle blobs are actually pulled lazily. |
| 121 | +On boot, the number of cached blobs would be like below. |
| 122 | +``` |
| 123 | +find ${LOCAL_CACHE_STORE} -name *.cacnk | wc -l |
| 124 | +96 |
| 125 | +find ${SSH_SERVER_STORE} -name *.cacnk | wc -l |
| 126 | +951 |
| 127 | +``` |
| 128 | +The number of blobs in ${LOCAL_CACHE_STORE} will increase on access to the rootfs. |
| 129 | +After executing `top` command inside container, the number of cached blobs would increase likely below. |
| 130 | +``` |
| 131 | +find ${LOCAL_CACHE_STORE} -name *.cacnk | wc -l |
| 132 | +139 |
| 133 | +find ${SSH_SERVER_STORE} -name *.cacnk | wc -l |
| 134 | +951 |
| 135 | +``` |
0 commit comments