-
Notifications
You must be signed in to change notification settings - Fork 11.8k
(Discussion) Improve usability of llama-server #13367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think that would be great, it would help bring llama.cpp to more people. It is also good to exercise this kind of full-application functionality in our own code, because then things like handling errors gracefully become mandatory, and I think there are many cases where llama.cpp still crashes when opening a model or choosing the wrong parameters. I suspect that other applications like LM Studio handle this by spawning the llama.cpp process separately and restarting it if it crashes, but we should do better than that. |
I'm literally just writing a wrapper to swap llama.cpp configurations and make it emulate both LM Studio and Ollama (since IntelliJ Assistant only works with LM Studio and Copilot only works with Ollama :> |
One feature that I would have found useful for Elo HeLLM is if the server could more efficiently parallelize requests across multiple GPUs. What I ended up doing is to just spawn multiple server processes that each received an equal percentage of the prompts. It would be nice if I could instead spawn a single server process which then automatically balances the load across the GPUs. This is something that maybe makes more sense to implement at the ggml backend level though. |
More generally, loading/unloading models via an API would also be nice (but for my use case this presupposes that I can get by with only a single server process). Maybe something like |
I think dynamically loading models via API is already accomplished by llama-swap. |
The But anw, I already thought of this use case in #13202 ; basically now we will have a registry-based model manager like docker/ollama, so this means we can display a list of available models ready to be loaded. And in far in the future, we could also allow user to search for online models via HF API (which can be done 100% on web UI), much like how it's currently done in LM Studio |
A use case for
It might be complicated to support this use case, so it's probably low priority. But it's something to think about too. Anyway, great ideas (also #13385) - let's improve the usability! |
This seems like should be completely solved by the operating system (e.g. |
Hmm yeah this will requires a way to install the binary as a system service though (which won't be a low-hanging fruit.) I'm not entirely sure what is the best way to do this on windows and mac, so if someone more experienced about this, please feel free to leave suggestions |
I think the admin API for loading models which is proposed here should also run on dedicated network interface ( |
Yeah but making that accessible in the main project in a user friendly way would be a huge deal, especially for Windows users who do not have docker and the like installed. |
You don't need Docker to use llama-swap, in fact the author provide precompiled Windows binary of it. |
podman does this kinda thing, it uses the What I will say is try to be somewhat like Ollama if we can, then lots of tools like Open WebUI will "just work" |
By the way ramalama in the latest release uses vanilla llama-server for everything now, we stopped using llama-run, so if you have the llama.cpp binaries installed, you can just do:
for basic testing. |
I am liking Idea 2, to have a headless/daemon mode. Apps have done this successfully for years (or like Emacs, decades). Typically, the solutions forcing docker are asking a huge dependency upon users, and a general solution that just works in the traditional sense is easier to adapt to containers. Emacs basically forks a new daemon process to run in the background and then when editing new files the daemon handles them. Granted, even though it sounds simple in theory, it could be time consuming to get right. Actually, also Idea 1, to have the ADMIN API would be cool. There would be other hot-loading configuration possibilities as well! It need not be localhost only as well in the future if we use something like token-based authentication. 😊 |
So, I recently made myself a runner + swapper + backend proxy (https://github.com/pwilkin/llama-runner), therefore, I'll just go with a wishlist of what would be great for the server to have:
I'm not competent enough to help with LLM engine coding stuff, but I would be willing to contribute on some of the above if accepted. |
@pwilkin we need to implement something like https://github.com/pwilkin/llama-runner in python3 in ramalama, you could contribute similar code there |
I mean, my code is MIT-licensed and in Python, would probably need to review it and clean it up from any vibe-coding niceties before submitting anywhere serious, but feel free to poke around and tell me if you want help with a specific feature, I'll be happy to contribute. |
I think instead of trying to calculate this, it would be better to have some kind of "calibration" mode that repeatedly loads and OOMs until it finds the best settings. Ollama tried to do this sort of calculation in the past and it never worked well/properly in practice (ie: was either far too conservative and left several GB of VRAM unused, or just OOMed when using oddball mixes on multiple GPUs). |
What I would suggest is adding something like a "dry_run" parameter to the allocation code which, when set, doesn't actually allocate any memory but only sums up how much memory would be used. The amount of memory needed can then be compared to the amount that is actually available and you can optimize the number of GPU layers via simply binary search. The optimization for e.g. |
While working on #13365 , I'm thinking about the use case where people can control
llama-server
completely via web UI, this including load/unload models and turning off the server.The reason why I think about this idea is because I recently found myself going back to LM Studio quite often 😂 . llama.cpp server is good, but having to go back and forth between web <> CLI is not always a pleasant experience.
Basically I'm thinking about 3 low-hanging fruits that could improve the situation:
Idea 1: allow loading / unloading model via API: in
server.cpp
, we can add a kinda "super"main()
function that wraps around the currentmain()
. The new main will spawn an "interim" HTTP server that expose the API to load a model. Ofc this functionality will be restricted to local deployment to avoid any security issues.Idea 2: add a
-d, --detach
flag to make the CLI go "headless", so user can close the terminal and server can keep running on background. It should be trivial to do on Mac and Linux, but may requires some efforts on Windows. We can add an API to terminate the server process.Idea 3: we can make a desktop shortcut that opens the web browser to llama.cpp
localhost
page. Basically this will make llama.cpp to become an "app" on the desktop, without spending too much efforts on our side. This is nice-to-have, but just noting here to see if anyone else have a better idea.WDYT @ggerganov @slaren ?
Also tagging @cebtenzzre if you have any suggestions for the
-d, --detach
modeThe text was updated successfully, but these errors were encountered: