[RFC] ggml: new backend for API Remoting #17072

kpouget · 2025-11-07T11:15:26Z

Hello, I would like to discuss if this work could be integrated in the llama.cpp codebase.

The API Remoting backend/frontend allow escaping the VM isolation, with the help of the virt-gpu paravirtualization (and the virglrenderer library on the host side).

ggml-remotingfrontend is a GGML API implementation, which intercepts the GGML API calls and forwards them to the virt-gpu virtual device
ggml-remotingbackend is library loaded by virglrenderer (PR will be opened soon for discussion), which opens a GGML library and forwards the call received from virglrenderer.

The code is currently a POC, I will refine it after the first round of feedback.

Some serialization functions have been borrowed from ggml-RPC. The overall idea is the same, but the transport layer is virtualization aware, which helps limiting the buffer copies.
the supports_op method is implemented in a hacky way: I've copied the ggml-metal definition to the frontend library, and I expose the few properties required to compute it from the ggml-metal backend. IIRC, this was only needed for the micro-benchmark to work correctly (the ggml-rpc simply returns true to avoid this bottleneck).

Here is the context behind this PR:

How we improved AI inference on macOS Podman containers --> the performance of ggml-Vulkan on Mac is 75-80% of ggml-metal
Reach native speed with MacOS llama.cpp container inference --> with API Remoting, the llama.cpp in a VM container runs at nearly 100% of ggml-metal

…timize

rgerganov · 2025-11-09T09:45:20Z

Very interesting work, thanks for sharing it!

Is it possible to get your PoC running on a Linux host with libkrun and KVM?

kpouget · 2025-11-10T14:31:04Z

Is it possible to get your PoC running on a Linux host with libkrun and KVM?

not yet, as MacOS has been the main target so far, but I'm working now on setting up the Linux environment where I can test this setup.
In theory, it should work fine out of the box. In practice ... time will tell :)

The host side relies on virglrenderer, which had to be modified for libkrun/MacOS to work in-process, while on Linux virglrenderer runs a separate process. So I need to see if my code works well when triggered inside the separate process. When confirmed, I'll open a PR on virglrenderer upstream, and I'll share the instructions to test the full stack on Linux.

For MacOS, the user-friendly instructions are detailed in the blog post, and I can share the steps to build from sources on demand.

kpouget added 2 commits November 7, 2025 11:06

ggml: add the ggml-remotingfrontend and ggml-remotingbackend libraries

9b592d5

ggml: src: ggml-remotingfrontend/ggml-backend: add stub for .graph_op…

f28602d

…timize

kpouget requested review from ggerganov and slaren as code owners November 7, 2025 11:15

github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Nov 7, 2025

DajanaV mentioned this pull request Nov 7, 2025

UPSTREAM PR #17072: [RFC] ggml: new backend for API Remoting auroralabs-loci/llama.cpp#114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] ggml: new backend for API Remoting #17072

[RFC] ggml: new backend for API Remoting #17072

kpouget commented Nov 7, 2025

Uh oh!

rgerganov commented Nov 9, 2025

Uh oh!

kpouget commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[RFC] ggml: new backend for API Remoting #17072

Are you sure you want to change the base?

[RFC] ggml: new backend for API Remoting #17072

Conversation

kpouget commented Nov 7, 2025

Uh oh!

rgerganov commented Nov 9, 2025

Uh oh!

kpouget commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants