Skip to content

(Discussion) Improve usability of llama-server #13367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ngxson opened this issue May 7, 2025 · 20 comments
Open

(Discussion) Improve usability of llama-server #13367

ngxson opened this issue May 7, 2025 · 20 comments

Comments

@ngxson
Copy link
Collaborator

ngxson commented May 7, 2025

While working on #13365 , I'm thinking about the use case where people can control llama-server completely via web UI, this including load/unload models and turning off the server.

The reason why I think about this idea is because I recently found myself going back to LM Studio quite often 😂 . llama.cpp server is good, but having to go back and forth between web <> CLI is not always a pleasant experience.

Basically I'm thinking about 3 low-hanging fruits that could improve the situation:

Idea 1: allow loading / unloading model via API: in server.cpp, we can add a kinda "super" main() function that wraps around the current main(). The new main will spawn an "interim" HTTP server that expose the API to load a model. Ofc this functionality will be restricted to local deployment to avoid any security issues.

Idea 2: add a -d, --detach flag to make the CLI go "headless", so user can close the terminal and server can keep running on background. It should be trivial to do on Mac and Linux, but may requires some efforts on Windows. We can add an API to terminate the server process.

Idea 3: we can make a desktop shortcut that opens the web browser to llama.cpp localhost page. Basically this will make llama.cpp to become an "app" on the desktop, without spending too much efforts on our side. This is nice-to-have, but just noting here to see if anyone else have a better idea.

WDYT @ggerganov @slaren ?

Also tagging @cebtenzzre if you have any suggestions for the -d, --detach mode

@slaren
Copy link
Member

slaren commented May 7, 2025

I think that would be great, it would help bring llama.cpp to more people. It is also good to exercise this kind of full-application functionality in our own code, because then things like handling errors gracefully become mandatory, and I think there are many cases where llama.cpp still crashes when opening a model or choosing the wrong parameters. I suspect that other applications like LM Studio handle this by spawning the llama.cpp process separately and restarting it if it crashes, but we should do better than that.

@pwilkin
Copy link
Contributor

pwilkin commented May 7, 2025

I'm literally just writing a wrapper to swap llama.cpp configurations and make it emulate both LM Studio and Ollama (since IntelliJ Assistant only works with LM Studio and Copilot only works with Ollama :>

@JohannesGaessler
Copy link
Collaborator

One feature that I would have found useful for Elo HeLLM is if the server could more efficiently parallelize requests across multiple GPUs. What I ended up doing is to just spawn multiple server processes that each received an equal percentage of the prompts. It would be nice if I could instead spawn a single server process which then automatically balances the load across the GPUs. This is something that maybe makes more sense to implement at the ggml backend level though.

@JohannesGaessler
Copy link
Collaborator

More generally, loading/unloading models via an API would also be nice (but for my use case this presupposes that I can get by with only a single server process). Maybe something like --model-dir to determine the models that are available?

@kth8
Copy link

kth8 commented May 8, 2025

I think dynamically loading models via API is already accomplished by llama-swap.

@ngxson
Copy link
Collaborator Author

ngxson commented May 8, 2025

More generally, loading/unloading models via an API would also be nice (but for my use case this presupposes that I can get by with only a single server process). Maybe something like --model-dir to determine the models that are available?

The --model-dir can be useful if the model can be contained in single file, but in case of vision model it may not be very practical.

But anw, I already thought of this use case in #13202 ; basically now we will have a registry-based model manager like docker/ollama, so this means we can display a list of available models ready to be loaded.

And in far in the future, we could also allow user to search for online models via HF API (which can be done 100% on web UI), much like how it's currently done in LM Studio

@ggerganov
Copy link
Member

A use case for llama.vscode is to be able to start multiple llama-server instances at the same time on different ports. For example, with the latest version we would need 3 separate servers:

  • FIM (e.g. Qwen 2.5 Coder)
  • Chat (e.g. Qwen 3)
  • Embeddings (e.g. BERT)

It might be complicated to support this use case, so it's probably low priority. But it's something to think about too.

Anyway, great ideas (also #13385) - let's improve the usability!

@ggerganov
Copy link
Member

Idea 2: add a -d, --detach flag to make the CLI go "headless"

This seems like should be completely solved by the operating system (e.g. systemd on Ubuntu, LaunchAgents on MacOS, Windows Services, etc.).

@ngxson
Copy link
Collaborator Author

ngxson commented May 9, 2025

This seems like should be completely solved by the operating system (e.g. systemd on Ubuntu, LaunchAgents on MacOS, Windows Services, etc.).

Hmm yeah this will requires a way to install the binary as a system service though (which won't be a low-hanging fruit.) I'm not entirely sure what is the best way to do this on windows and mac, so if someone more experienced about this, please feel free to leave suggestions

@rgerganov
Copy link
Collaborator

A use case for llama.vscode is to be able to start multiple llama-server instances at the same time on different ports.

I think the admin API for loading models which is proposed here should also run on dedicated network interface (localhost by default) and port. Having a clear separation of admin vs. user APIs will make the deployment easier in some cases. Later the admin API could be extended to start multiple instances of llama-server

@Dampfinchen
Copy link

I think dynamically loading models via API is already accomplished by llama-swap.

Yeah but making that accessible in the main project in a user friendly way would be a huge deal, especially for Windows users who do not have docker and the like installed.

@kth8
Copy link

kth8 commented May 10, 2025

I think dynamically loading models via API is already accomplished by llama-swap.

Yeah but making that accessible in the main project in a user friendly way would be a huge deal, especially for Windows users who do not have docker and the like installed.

You don't need Docker to use llama-swap, in fact the author provide precompiled Windows binary of it.

@ericcurtin
Copy link
Collaborator

This seems like should be completely solved by the operating system (e.g. systemd on Ubuntu, LaunchAgents on MacOS, Windows Services, etc.).

Hmm yeah this will requires a way to install the binary as a system service though (which won't be a low-hanging fruit.) I'm not entirely sure what is the best way to do this on windows and mac, so if someone more experienced about this, please feel free to leave suggestions

podman does this kinda thing, it uses the -d flag. Technically you don't really have to implement this, the user can always do things like nohup llama-server &.

What I will say is try to be somewhat like Ollama if we can, then lots of tools like Open WebUI will "just work"

@ericcurtin
Copy link
Collaborator

ericcurtin commented May 11, 2025

I think that would be great, it would help bring llama.cpp to more people. It is also good to exercise this kind of full-application functionality in our own code, because then things like handling errors gracefully become mandatory, and I think there are many cases where llama.cpp still crashes when opening a model or choosing the wrong parameters. I suspect that other applications like LM Studio handle this by spawning the llama.cpp process separately and restarting it if it crashes, but we should do better than that.

By the way ramalama in the latest release uses vanilla llama-server for everything now, we stopped using llama-run, so if you have the llama.cpp binaries installed, you can just do:

ramalama --nocontainer run gemma3
ramalama --nocontainer serve gemma3
etc.

for basic testing.

@bandoti
Copy link
Collaborator

bandoti commented May 11, 2025

I am liking Idea 2, to have a headless/daemon mode. Apps have done this successfully for years (or like Emacs, decades). Typically, the solutions forcing docker are asking a huge dependency upon users, and a general solution that just works in the traditional sense is easier to adapt to containers. Emacs basically forks a new daemon process to run in the background and then when editing new files the daemon handles them. Granted, even though it sounds simple in theory, it could be time consuming to get right.

Actually, also Idea 1, to have the ADMIN API would be cool. There would be other hot-loading configuration possibilities as well! It need not be localhost only as well in the future if we use something like token-based authentication. 😊

@pwilkin
Copy link
Contributor

pwilkin commented May 11, 2025

So, I recently made myself a runner + swapper + backend proxy (https://github.com/pwilkin/llama-runner), therefore, I'll just go with a wishlist of what would be great for the server to have:

  • Model configuration presets; this is something Ollama lacks i.e. with it default 2048 context size. Generally there should be an idea to either add presets in a separate config file or read them from the model dir (a file with a .config extension instead of .gguf?). Swapping itself doesn't give much if the user can't configure the optimal parameters for their setup.
  • Dream version: autoconfig with presets. So, you specify your available system specs (something like: use max 8.5 GM VRAM + 10 GB RAM, long contexts) and the loader calculates stuff and loads it based on the model data + your preferences. Say I know model + compute buffers takes 6 GB and context takes 4 GB per 10k, so the loader says "okay, user wants long context so 20k minimum, so I offload entire model on VRAM and use no-cache-offload for KV cache".
  • Emulate existing formats - this is somewhat of a long-shot, because those formats could change at any moment, but it would be cool for the server to be able to say "expose Ollama-compatible endpoints" or "expose LMStudio-compatible endpoints". Of course, there would be limits to the compatibility, but I guess a lot of people use the models in applications that do not provide full configurability for LLM providers. For example, I use GitHub Copilot in VS Code and AI Assistant in IntelliJ. Copilot only accepts Ollama as a local model backend, AI Assistant only accepts either Ollama or LMStudio (this was my main rationale for writing the proxy runner).
  • Constrained multi-runner: needs some of the same logic as the "dream-version" presets, but not as demanding. Basically the idea is: user configures their system constraints, runner checks how much resources is used and how much resources are going to be used and determines if they can load the next model or if they want to replace an already running one. As in "have 24 GB VRAM and 20 GB of RAM, already running one model that takes up 16 GB of VRAM, can I run a 5B version of this model with 20k context extra?".|
  • Add detailed memory usage to /health: tensors, compute buffers, KV cache

I'm not competent enough to help with LLM engine coding stuff, but I would be willing to contribute on some of the above if accepted.

@ericcurtin
Copy link
Collaborator

@pwilkin we need to implement something like https://github.com/pwilkin/llama-runner in python3 in ramalama, you could contribute similar code there

@pwilkin
Copy link
Contributor

pwilkin commented May 11, 2025

@pwilkin we need to implement something like https://github.com/pwilkin/llama-runner in python3 in ramalama, you could contribute similar code there

I mean, my code is MIT-licensed and in Python, would probably need to review it and clean it up from any vibe-coding niceties before submitting anywhere serious, but feel free to poke around and tell me if you want help with a specific feature, I'll be happy to contribute.

@jukofyork
Copy link
Collaborator

Dream version: autoconfig with presets. So, you specify your available system specs (something like: use max 8.5 GM VRAM + 10 GB RAM, long contexts) and the loader calculates stuff and loads it based on the model data + your preferences. Say I know model + compute buffers takes 6 GB and context takes 4 GB per 10k, so the loader says "okay, user wants long context so 20k minimum, so I offload entire model on VRAM and use no-cache-offload for KV cache".

I think instead of trying to calculate this, it would be better to have some kind of "calibration" mode that repeatedly loads and OOMs until it finds the best settings. Ollama tried to do this sort of calculation in the past and it never worked well/properly in practice (ie: was either far too conservative and left several GB of VRAM unused, or just OOMed when using oddball mixes on multiple GPUs).

@JohannesGaessler
Copy link
Collaborator

What I would suggest is adding something like a "dry_run" parameter to the allocation code which, when set, doesn't actually allocate any memory but only sums up how much memory would be used. The amount of memory needed can then be compared to the amount that is actually available and you can optimize the number of GPU layers via simply binary search. The optimization for e.g. --tensor-split would be more complicated but it is doable via standard numerical optimization techniques.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests