-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[Devices] virtio vsock #650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Updating this issue with the current state of affairs, as discussed internally. The main focus is on maintaining the Firecracker security barrier. I.e. Firecracker must control all data exchanged between the guest and the host. Using the stock vsock-via-vhost mechanism would bypass this barrier, allowing a malicious guest to pass data directly to the host kernel. Currently I see two possible approaches to have both vsock and our security barrier:
I'm inclined towards the latter and I'll need to test it, since it's just a hypothesis, at the moment. Will update here in a couple of weeks. |
An example of option 1 can be found in https://github.com/moby/hyperkit/blob/master/src/lib/pci_virtio_sock.c |
Update: internal discussion has settled on approach no 1. The guest end would be an emulated AF_VSOCK socket, while the host end would be provided via an AF_LOCAL / AF_UNIX socket. This will require establishing some convention to support vsock ports, such as appending the port number to the host-end unix socket file path (e.g. Also, from the host end, this will require at least one listener (one socket) per guest, since each Firecracker VMM is designed to be isolated and independent. If anybody tracking this issue has any input, please feel free to chime in. There are details not decided upon yet, so we'd appreciate any input into making this feature easier to integrate for Firecracker users. |
@dhrgit I don't have a good grasp on potential performance issues related to this approach, but it'd be nice to include @stefanha in this discussion since he's vsock maintainer, and I'm sure he could give some valuable input here. |
Approach #1 is comparable to port forwarding. It's fine if you control all applications and the protocols they speak. If you wish to support existing applications then you may be forced to make invasive changes to those applications: When communication involves exchanging network address information over an existing connection, the host application receiving AF_VSOCK address information would be confused because that address isn't directly usable with AF_LOCAL. Also, managing port forwarding complicates things if you wish to allow users to add their own applications but is easy with a fixed set of applications that you control. |
@sboeuf On the host side, though, the only way to support AF_VSOCK would be via vhost. Since the dominant opinion is to avoid adding vhost as a new dependency (and attack surface), we need to find another solution. The proposed approach (to use an AF_UNIX socket) is what hyperkit does and is linked to above by @rn. This does impose some limitations, since there's no perfect translation between AF_VSOCK and AF_UNIX. Both the host apps and the guest apps will need to be aware of these limitations, in order to establish a communication channel. The implementation details are not yet set, so my proposal is to use this issue here to discuss them and come up with a solution that makes sense to everyone. |
Is it one AF_UNIX socket per guest, or one AF_UNIX socket per AF_VSOCK port? I think either would work for firecracker-containerd, but we expect to use multiple ports and would need to handle multiplexing if there's only a single AF_UNIX socket per guest. |
@mcastelino and I talked about the approach you're about to take. From a Kata Containers perspective, it should be fine as we can tweak the application, Now, from an implementation perspective, we were discussing how there were two distincts use cases that you need to handle depending if the application running inside the VM runs as:
Globally, you will proxy everything through the Let me know if I'm missing something or if I actually misunderstood parts of the design. |
@mcastelino and I thought about a slightly different approach that would not require the wrapper translation between This approach is very similar to the one you proposed since it prevents the kernel from interacting directly with the virtqueues. It is also similar in the way that the kernel would still be the one passing data from the application in userspace to the The main benefit of this approach is that you could run unmodified applications, since they would still talk to an The main drawback of this approach is that we would need to introduce a new kernel module, but long term, this could be reused by a bunch of applications that would simply encapsulate the data that could be shared across a cluster of VMs for instance. Basically, this would make vsock more portable. Anyway, let me know what you think about this approach! /cc @stefanha |
@sboeuf A vhost_user transport is technically not necessary since vhost_vsock already provides the same functionality to any userspace process. While it might not be obvious at first glance, vhost_vsock is not tied to virtualization. You don't need to expose vhost_vsock to a guest. vhost_vsock is just a way of claiming a CID and transferring vsock messages to/from userspace. The main difference between reusing vhost_vsock and implementing vhost_user is the userspace API: vhost ioctls + eventfd vs a tap-like file descriptor. Both of these drivers still have to parse vsock messages in order to hand them to net/vmw_vsock/af_vsock.c (there are control messages, not just payloads). The security argument becomes whether you believe parsing an equivalent header from a tap-like file descriptor is more secure than parsing the header from the vring. In other words, the vsock_user code could also have bugs that lead to a host kernel compromise. If you are willing to accept that risk, then I think it makes more sense to reuse vhost_vsock.ko without exposing it to the guest. This will save you a lot of time now and maintenance in the future. |
I can see that, since the memory regions given to
How would I register a CID to the driver from the
Well I thought the point was that because you receive a vsock frame through the vring, and because the data has to be retrieved based on what is provided by the vring, it'd be better to make this happen from userspace. The virtio code in the VMM itself would reach out the virtqueue buffer pointed by the descriptor table, and in case of a malicious guest (putting wrong addresses in the descriptor table), it would not be able to access some random memory on the host. |
The tun/tap driver uses an ioctl to register the interface. Something similar could be done for the vhost_user fd. But we should first discuss whether vhost_user is necessary.
Here is the vhost_vsock solution that is not exposed to the guest: The VMM opens /dev/vhost-vsock and issues ioctls to set the guest's CID and a userspace memory region where messages will be placed. The VMM emulates the virtio-vsock device. Virtqueue processing is done by the VMM. It takes messages from the virtqueue and sanity-checks them. It may also copy data buffers to/from guest memory if the goal is never to expose guest RAM to the host kernel. The virtio-vsock device constructs new virtio-vsock messages in the userspace memory region registered with vhost-vsock and signals the kick eventfd. The host kernel vhost_vsock module then processes those messages and communicates with AF_VSOCK sockets on the host. Replies from host AF_VSOCK sockets come back in the reverse direction. vhost_vsock signals the notify eventfd which the VMM is monitoring. The virtio-vsock device emulation code takes reply messages from the userspace memory region and places them into the vring in guest RAM and notifies the guest. This way the guest never directly interacts with vhost_vsock. vhost_vsock serves only as the API for communicating with the host kernel network stack. In this solution the guest vrings are processed by the VMM, not by vhost_vsock. vhost_vsock plays the same role as vhost_user, it never touches the guest's vrings.
We need input from folks who originally said they cannot use vhost_vsock for security reasons. There is a spectrum here from "I don't want the host kernel network stack involved at all, it's too risky and I only trust AF_UNIX" to "I just don't want the host kernel to process guest vrings". I'm not sure what the consensus on this is in the Firecracker community. |
This is quite an accurate description of the solution I've been advocating. I.e. reconstruct (and possibly sanitize) the virtio-vsock frames in the VMM userspace, and only have the VMM interact with vhost. The main arguments against it, as I understood them, were a) vhost would be an extra dependency on the host, and b) sanitization code would be too complex. Perhaps @rn would be better suited to go into more details on those arguments. |
Of course, agreed here.
Now I get it :)
Why does that matter since it's on the host?
Even with the AF_UNIX to AF_VSOCK solution, wouldn't you be sanitizing frames received from vrings before to inject them into the kernel through the socket? Looking forward to @rn feedback :) |
@rn |
Apologies for the long delay in replying. I had a hectic schedule. Our main concern with using vhost is that there would be direct interaction of the guest with a complex, in kernel component (beyond MMIO). So the guest could provide arbitrary inputs and can try to exploit flaws in the in kernel implementation. Hence the proposal of implementing the virtio vsock "backend" in firecracker, a process which can be jailed. In my opinion, this is a very important property. Now, once we do that, the next question is, how to we expose the vsocks to other processes? We could have the firecracker vsock backend basically proxy vsock in to the kernel. For that there seem to be two options, either feeding stuff into rings directly (after sanitizing) or proxy at a higher level, like a byte streams. I think the sanitizing is tricky to get right and you still risk the guest being able to more or less directly control the inputs into the kernel. If you proxy at a higher level (like byte streams) you basically terminate and re-originate vsock connections and this is not much different to proxying to a AF_UNIX socket. I would think that AF_UNIX is better understood and tested/hardened and does not require the host kernel to be configured to include vsock support (less code). Hence the suggestion to go with AF_UNIX. |
Thanks for the feedback @rn
I agree this looks a bit more secure, but the price to pay is the introduction of some kind of hybrid protocol that will need to be handled by the application on the host. This means the application will not be compatible with other hypervisors unless specific modifications. Also, bypassing the vsock kernel code like this, we are definitely not contributing for making it better long term.
What about the complexity of translating |
As mentioned in a previous comment, one way to consider this trade-off is whether you have a small, fixed number of vsock services you wish to run (then the AF_UNIX approach is fine) or whether vsock should be generally usable for user-defined purposes (then the AF_UNIX approach is impractical because it requires extensive modifications to Sockets API applications and some protocols may be unportable). Which use case do you have in mind? |
I believe we are looking at the first use case. I.e. we are adding vsock, in order to enable container orchestrators to deploy and communicate with their agents inside the virtualized container / microvm. We haven't explored, in depth, the idea of enabling generic applications over vsock. Is there demand for that? |
@stefanha your question is spot on and probably something we should have made more clear initially. Indeed, we just want to support whatever minimum set of functionality works for container orchestrators (we've arrived here by taking in requirements from I think a good next step will be to detail the actual proposal and get feedback from container folks that want to use Firecracker as a microVM runtime. |
By the way, if there's demand for the |
That's definitely what the end goal should be IMO.
Agreed, as part of this work will be reused/shared with |
And I think this is the path we're on now: Firecracker keeps developing as a very narrow, focused, and optimized building block (as per our charter statement "Our mission is to enable secure, multi-tenant, minimal-overhead execution of container and function workloads." ... and nothing more), while rust-vmm grows into a community project where the various crates provide really top-quality VMM functionality (and there's no problem if there more than one way to do things) So, if Firecracker users end up wanting the full I guess a discussion for the near future, once @dhrgit comes up with his |
As long as the scope is clear for potential users, I can see how other VMM could benefit from it, hence no reason not to have this in |
Hey everyone - I've posted #1044 as an RFC on the proposed vsock design. |
We'll be happy to hear everyone's feedback & comments! |
The chosen solution has a great property which has no dependency on the host kernel version. |
Update for visibility:
Current code available in my WiP branch: https://github.com/dhrgit/firecracker/tree/vsock-wip |
Added firecracker-experimental.yaml in api_server/swagger. This file is a copy of firecracker.yaml with an additional definiton for the vsock API request. The point of this file is to be used by third-party projects (like Kata Containers) to automatically generate an API client that knows how to send requests to the Firecracker API. The definition currently lies in a different file and not in firecracker.yaml because vsock is consider an experimental feature. Once the production ready vsock is merged we will get rid of this experimental yaml. For tracking purposes, this is the issue regarding switching to vsock with UDS: firecracker-microvm#650 In the new proposed vsock implementation, the API request for configuring the vsock will also change. Fixes firecracker-microvm#1085 Signed-off-by: Andreea Florescu <[email protected]>
Added firecracker-experimental.yaml in api_server/swagger. This file is a copy of firecracker.yaml with an additional definiton for the vsock API request. The point of this file is to be used by third-party projects (like Kata Containers) to automatically generate an API client that knows how to send requests to the Firecracker API. The definition currently lies in a different file and not in firecracker.yaml because vsock is consider an experimental feature. Once the production ready vsock is merged we will get rid of this experimental yaml. For tracking purposes, this is the issue regarding switching to vsock with UDS: firecracker-microvm#650 In the new proposed vsock implementation, the API request for configuring the vsock will also change. Fixes firecracker-microvm#1085 Signed-off-by: Andreea Florescu <[email protected]>
Added firecracker-experimental.yaml in api_server/swagger. This file is a copy of firecracker.yaml with an additional definiton for the vsock API request. The point of this file is to be used by third-party projects (like Kata Containers) to automatically generate an API client that knows how to send requests to the Firecracker API. The definition currently lies in a different file and not in firecracker.yaml because vsock is consider an experimental feature. Once the production ready vsock is merged we will get rid of this experimental yaml. For tracking purposes, this is the issue regarding switching to vsock with UDS: firecracker-microvm#650 In the new proposed vsock implementation, the API request for configuring the vsock will also change. Fixes firecracker-microvm#1085 Signed-off-by: Andreea Florescu <[email protected]>
Added firecracker-experimental.yaml in api_server/swagger. This file is a copy of firecracker.yaml with an additional definiton for the vsock API request. The point of this file is to be used by third-party projects (like Kata Containers) to automatically generate an API client that knows how to send requests to the Firecracker API. The definition currently lies in a different file and not in firecracker.yaml because vsock is consider an experimental feature. Once the production ready vsock is merged we will get rid of this experimental yaml. For tracking purposes, this is the issue regarding switching to vsock with UDS: firecracker-microvm#650 In the new proposed vsock implementation, the API request for configuring the vsock will also change. Fixes firecracker-microvm#1085 Signed-off-by: Andreea Florescu <[email protected]>
Added firecracker-experimental.yaml in api_server/swagger. This file is a copy of firecracker.yaml with an additional definiton for the vsock API request. The point of this file is to be used by third-party projects (like Kata Containers) to automatically generate an API client that knows how to send requests to the Firecracker API. The definition currently lies in a different file and not in firecracker.yaml because vsock is consider an experimental feature. Once the production ready vsock is merged we will get rid of this experimental yaml. For tracking purposes, this is the issue regarding switching to vsock with UDS: firecracker-microvm#650 In the new proposed vsock implementation, the API request for configuring the vsock will also change. Fixes firecracker-microvm#1085 Signed-off-by: Andreea Florescu <[email protected]>
Added firecracker-experimental.yaml in api_server/swagger. This file is a copy of firecracker.yaml with an additional definiton for the vsock API request. The point of this file is to be used by third-party projects (like Kata Containers) to automatically generate an API client that knows how to send requests to the Firecracker API. The definition currently lies in a different file and not in firecracker.yaml because vsock is consider an experimental feature. Once the production ready vsock is merged we will get rid of this experimental yaml. For tracking purposes, this is the issue regarding switching to vsock with UDS: firecracker-microvm#650 In the new proposed vsock implementation, the API request for configuring the vsock will also change. Fixes firecracker-microvm#1085 Signed-off-by: Andreea Florescu <[email protected]>
Added firecracker-experimental.yaml in api_server/swagger. This file is a copy of firecracker.yaml with an additional definiton for the vsock API request. The point of this file is to be used by third-party projects (like Kata Containers) to automatically generate an API client that knows how to send requests to the Firecracker API. The definition currently lies in a different file and not in firecracker.yaml because vsock is consider an experimental feature. Once the production ready vsock is merged we will get rid of this experimental yaml. For tracking purposes, this is the issue regarding switching to vsock with UDS: #650 In the new proposed vsock implementation, the API request for configuring the vsock will also change. Fixes #1085 Signed-off-by: Andreea Florescu <[email protected]>
The PR is up: #1106 |
Closed by #1176 . |
We want to have vsock support to enable container integration, but we don't want to use vhost since that would be another attack surface to directly exposes the host kernel. Instead, we'll write another back-end for vsock. See #194.
Currenty, virtio vsock exists in the codebase as an experimental development-only rust feature, with the vhost implementation.
The text was updated successfully, but these errors were encountered: