Skip to content

Conversation

@kpouget
Copy link

@kpouget kpouget commented Nov 7, 2025

Hello, I would like to discuss if this work could be integrated in the llama.cpp codebase.

The API Remoting backend/frontend allow escaping the VM isolation, with the help of the virt-gpu paravirtualization (and the virglrenderer library on the host side).

  • ggml-remotingfrontend is a GGML API implementation, which intercepts the GGML API calls and forwards them to the virt-gpu virtual device
  • ggml-remotingbackend is library loaded by virglrenderer (PR will be opened soon for discussion), which opens a GGML library and forwards the call received from virglrenderer.

The code is currently a POC, I will refine it after the first round of feedback.

  • Some serialization functions have been borrowed from ggml-RPC. The overall idea is the same, but the transport layer is virtualization aware, which helps limiting the buffer copies.
  • the supports_op method is implemented in a hacky way: I've copied the ggml-metal definition to the frontend library, and I expose the few properties required to compute it from the ggml-metal backend. IIRC, this was only needed for the micro-benchmark to work correctly (the ggml-rpc simply returns true to avoid this bottleneck).

Here is the context behind this PR:

image

@github-actions github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 7, 2025
@rgerganov
Copy link
Collaborator

Very interesting work, thanks for sharing it!

Is it possible to get your PoC running on a Linux host with libkrun and KVM?

@kpouget
Copy link
Author

kpouget commented Nov 10, 2025

Is it possible to get your PoC running on a Linux host with libkrun and KVM?

not yet, as MacOS has been the main target so far, but I'm working now on setting up the Linux environment where I can test this setup.
In theory, it should work fine out of the box. In practice ... time will tell :)

The host side relies on virglrenderer, which had to be modified for libkrun/MacOS to work in-process, while on Linux virglrenderer runs a separate process. So I need to see if my code works well when triggered inside the separate process. When confirmed, I'll open a PR on virglrenderer upstream, and I'll share the instructions to test the full stack on Linux.

For MacOS, the user-friendly instructions are detailed in the blog post, and I can share the steps to build from sources on demand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) build Compilation issues ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants