Skip to content

Separating and Streamlining llama/llava binaries Suggestion #583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zsogitbe opened this issue Mar 8, 2024 · 20 comments
Open

Separating and Streamlining llama/llava binaries Suggestion #583

zsogitbe opened this issue Mar 8, 2024 · 20 comments
Labels
stale Stale issue will be autoclosed soon

Comments

@zsogitbe
Copy link
Contributor

zsogitbe commented Mar 8, 2024

I do not like the current strategy concerning the DLLs included in the 'runtimes' folders everywhere. This makes also the loading of the libraries very difficult.

Suggestions:

  • Create one solution which is responsible only for making the DLLs.
  • Simplification of the library loading logic to loading it from the folder where the executable is (for example, net7.0).
  • Allow 1 lama and 1 lava DLL in the output folder (for example, net7.0) (no subdirectories and many platforms anymore).
  • Create 1 DLL which works with CUDA and also with CPU (the cuda version option can be chosen in the solution which is responsible for the binaries).
@martindevans
Copy link
Member

I don't think it's possible to have one single binary that works for everything.

For example you need to compile the DLL with noavx/AVX/AVX2/AVX512 support and pick the right one depending on what your CPU can support, as far as I know it's not possible to have a single llama.dll that supports all of those.

Also CUDA/NotCUDA are not the only two options! We have people interested in Vulkan, OpenCL, ZLUDA and various other backends. It seems likely llama.cpp itself will grow more backends over time, which we'll want to add support for as well.

@zsogitbe
Copy link
Contributor Author

zsogitbe commented Mar 8, 2024

I think that we do not need 1 binary that works for everything. What we need is a solution that makes 1 desired binary. In VS one can just simply make a separate configuration for each setup. The user can decide which configuration he/she needs. In this way we can hugely simplify the library.
But, for example, I use 1 binary only for CUDA+AVX2 (CPU). I love simplicity and I do not like overcomplicating things!
The binary Solution should be able to make any desired combination of setups, but LLamaSharp should only use 1 binary for lama and 1 for lava.

@martindevans
Copy link
Member

Ah I think I see what you mean, so we (LLamaSharp) would still distribute multiple binaries but your final build process would just select 1 binary to ship? That sounds possible.

I'm still not convinced it's a good idea though - for example if you only shipped AVX2 binaries with your application that would limit the machines your application can run on to only those that support AVX2 and you'd be artifically limiting the inference speed on machine running your app that support AVX512.

This is actually kind of how it used to work. LLamaSharp shipped one CPU binary (we always had to choose a middling AVX level, not too high to exclude lots of machines but not too low that it's slowing down new machines) and 2 CUDA binaries. Runtime selection of binaries has been a huge improvement.

@martindevans
Copy link
Member

(Note that you can override the entire runtime selection process if you want to by using NativeLibraryConfig.WithLibrary(path))

@zsogitbe
Copy link
Contributor Author

zsogitbe commented Mar 8, 2024

I would not ship binaries at all. I would ship a solution which can make the desired binary simply by choosing a configuration. And, yes, LLamaSharp should only use binaries which are next to the exe (no complex runtime folders and dll's with possible compatibility issues).

I have actually never used the binaries that you ship now...

@martindevans
Copy link
Member

I don't really understand what you mean, sorry. Assuming we want to continue shipping something that can be easily installed through nuget, how does that work with what you're suggesting?

@zsogitbe
Copy link
Contributor Author

zsogitbe commented Mar 8, 2024

I do not use nuget packages like this personally because the binaries are usually outdated... and then I end up just recompiling the binaries myself. I would not include the binaries in the nuget package and I would provide a Solution with the appropriate setups linked to a specific llama.cpp version. The user can then easily compile the binary he/she wants.
Or someone can make the binaries with the above Solution and distribute extra nuget packages with the binaries only, but this will be continuous work to update the binaries all the time (like now).
I know that this is a bit different thinking than what you are used to. I do not expect you to immediately agree with my view.

@zsogitbe
Copy link
Contributor Author

zsogitbe commented Mar 8, 2024

I have just checked, llama.cpp has zip archives for each setup already... I did not open them, but I guess that those are the binaries.

@martindevans
Copy link
Member

There is continuous work for each version, but compiling and distributing the binaries isn't a major part of that work!

llama.cpp is constantly making breaking changes wihout any semantic versioning which might indicate what's compatible with what. That means a version of LLamaSharp is only guaranteed to work with one specific version of llama.cpp. Here's the set of changes for the March binary update for example. If you're using self compiled binaries that aren't the exact right version you're just getting lucky that they have happened to work. Given that, whatever we do we're alwyas going to be a bit behind in LLamaSharp.

@zsogitbe
Copy link
Contributor Author

zsogitbe commented Mar 8, 2024

It is really a great job that you are doing with these updates! Thank you!

Here are the versioned and easy to refer to releases of llama.cpp: https://github.com/ggerganov/llama.cpp/releases

It would be easy to say that, for example, we are at 'b2363' with LLamaSharp and that our code is compatible with the binaries that are in the release b2363 and with the source code that is in the release b2363...
Are you sure that you do not overcomplicate this thing?

@AsakusaRinne
Copy link
Collaborator

AsakusaRinne commented Mar 8, 2024

@zsogitbe Are you saying something like below?

  1. We remove all the backend packages or just keep the most basic one.
  2. We provide an API to automatically download binaries, such as NativeLibrary.Download("cuda12", "avx2").
  3. After the downloading, the user could load the library with NativeLibraryConfig.Instance.WithLibrary.

I'm not sure whether your proposal is even more aggressive in step 2, to let user just compile themselves.

@martindevans
Copy link
Member

It would be easy to say that, for example, we are at 'b2363' with LLamaSharp and that our code is compatible with the binaries that are in the release b2363 and with the source code that is in the release b2363...

That's basically what we do. The table here lists the exact llama.cpp version we're compatible with (as a commit hash instead of release ID, but it's more or less the same thing).

@zsogitbe
Copy link
Contributor Author

zsogitbe commented Mar 9, 2024

The current distribution with endless binaries per platform and even different setups is for me not usable. Also there are regularly dlls missing, for example for cuda, because it cannot be easily compiled. I feel here also some confusion. Also the loading is overcomplicated based on the many possible runtime paths. My suggestion is to hugely simplify all of this.

There can be many options which would satisfied everybody. One possible option is to distribute the binaries per platform and per setup in different nuget packages which then can be simply loaded into the solution in the exe folder (no endless runtime paths and no endless overhead of unneeded dlls). The making of the binary nuget packages could be automated based on the llama.cpp binaries that there is no work with it. But, this is just one possible solution to give you some ideas.

@AsakusaRinne
Copy link
Collaborator

One possible option is to distribute the binaries per platform and per setup in different nuget packages which then can be simply loaded into the solution in the exe folder.

I won't be against of this if there're not too many combinations! There're three dimensions we need to take account of, which are device (CPU, CUDA), operation system (Windows, Linux, MAC Metal, MAC Intel) and avx support (none, avx, avx2, avx512). There will be totally 26 backend packages we need to publish. If we introduce other backends such as OpenCL and ZLUDA in the future, this number will further increase.

no endless runtime paths and no endless overhead of unneeded dlls

I have an idea to provide an API to automatically download dll that the user want. We put the compiled dlls online and provide an API to download it.

NativeLibrary.Load(cuda: True, avx: "avx2", autoDownload: True);

I haven't come up with where to save these files yet but I think in this way users like you won't be annoyed. The backend nuget package will not be required at all if this way is choosed.

@martindevans
Copy link
Member

To be honest I'm not really sure what problem it is you're trying to solve? Could you explain what problem you've actually encountered. e.g. you said:

The current distribution with endless binaries per platform and even different setups is for me not usable.

What problem have you encountered with the current system? (there may be bugs that need fixing with it of course, but that's not a reason to tear it all down!)

Also there are regularly dlls missing, for example for cuda, because it cannot be easily compiled

In this case I'm not really sure what you mean - there haven't been any DLLs missing from any distribution package as far as I'm aware. Difficulty compiling isn't really an issue, since that's handled for you with the packaged binaries.

If you mean people asking about failures to load the distributed CUDA DLLs due to missing dependencies, that's because we don't currently distribute the cuda runtime binaries. That means end users need to provide them themselves, which is an endless source of questions from people who haven't set that up correctly. We did decide on a fix for this a while ago, but no one has taken on the work to do it. The plan is to distribute nuget packages with the relevant cudart binaries, which the llamasharp CUDA backend packages can then depend on (details of that here).

My Opinion

Coming at this from the opposite direction, in my opinion the runtime detection and automatic loading of the correct backend has been a huge improvement in the usability of LLamaSharp with the regards to ease of use and hardware compatibility.

If we switched to a system where you select exactly which backend you want (e.g. CPU with AVX2) now the user has to take into account extra complexities like SIMD support. So that combinatorial explosion AsakusaRinne mentioned becomes their problem, instead of ours.

Even worse it's not actually not even generally possible to pick correctly before runtime, because you never know what your end user machine is capable of. For example if you were to build an application and then give it to me I doubt you would have included AVX512 support (since it's a fairly rare feature for CPUs) and so it would be slower than necessary on my machine. The same applies to GPU support of course. If you've included CUDA12 support but then my hardware requires CUDA11 I'm completely out of luck. Wheras with runtime feature detection and loading it could just fall back to CUDA11, when it detects that is what's required by the hardware.

Another thing to note is that the config system does give you complete control over what the runtime loader does. You can even just not depend on a backend at all, supply your own binary and tell the loader to load exactly that DLL.

A compromise?

I have come up with an idea that might work for you (I'm not sure, since I'm not exactly sure what it is you want, but maybe).

At the moment we distribute one single CPU backend, that contains all of the different binaries. What if we split it up into many smaller packages, arranged in a hierarchy so you can be as detailed as you like? e.g.

  • LLamaSharp.Backend.All
    • LLamaSharp.Backend.GPU
      • LLamaSharp.Backend.CUDA
        • LLamaSharp.Backend.CUDA11
          • LLamaSharp.Backend.CUDART11
        • LLamaSharp.Backend.CUDA12
          • LLamaSharp.Backend.CUDART12
      • LLamaSharp.Backend.Vulkan
    • LLamaSharp.Backend.CPU
      • LLamaSharp.Backend.CPU.noavx
      • LLamaSharp.Backend.CPU.AVX
      • LLamaSharp.Backend.CPU.AVX2
      • LLamaSharp.Backend.CPU.AVX512

(We would even split the leaf nodes up into Linux, Windows and MacOS).

In this setup, most users could just depend on the top level node and the runtime loading and detection will pick and choose however it likes. On the other hand if you want more control you could install exactly the leaf node packages you wanted (or none, and supply your own binaries).

@zsogitbe
Copy link
Contributor Author

zsogitbe commented Mar 9, 2024 via email

@martindevans
Copy link
Member

Definitely not a waste of time! It's great to get feedback on how things work, if we all agreed on everything we'd never identify shortcomings in the current implementation ;)

I never use precompiled dlls for security reasons

This has been a very big concern of mine as well when working on LLamaSharp. For example it's why I often ask people to remove committed DLLs from PRs for example - I try to make sure every DLL we distribute can be traced back to a specific build run on GitHub (which means you can see the script that was run and be confident that's what actually happened). Not perfect, but better than random DLLs being committed with no "paper trail" at all!

Any ideas how that could be further improved from a security perspective?

@zsogitbe
Copy link
Contributor Author

I think that the core team who worked most on this project should decide with your lead. I can however tell you what I would do:

  • I would hugely simplify the way of working with the C++ binaries.
  • I would provide a CMake template with some kind of automation that the users can easily generate a solution file (for example, VS on Windows) based on their needs. They would input some basic setting (CPU, GPU, operating system, ....) and the automation (script?) would generate the solution file or even compile the DLLs automatically (I prefer that I can click compile in the solution because then I can change some things if needed - could provide both possibilities).
  • The aim is that the solution file just needs to be started and click compile and DLL ready.
  • The LLamaSharp library should use all binaries from the target path (where the exe is) without any directory structure of unneeded and possibly dangerous DLLs.
  • This would need some simple tweaking on the Cpp part probably, but no major modification is expected (just CMake).
  • In this way there are no precompiled DLLs and no security issues.

@zsogitbe
Copy link
Contributor Author

Important read: microsoft/kernel-memory#266

Copy link

github-actions bot commented May 6, 2025

This issue has been automatically marked as stale due to inactivity. If no further activity occurs, it will be closed in 7 days.

@github-actions github-actions bot added the stale Stale issue will be autoclosed soon label May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Stale issue will be autoclosed soon
Projects
None yet
Development

No branches or pull requests

3 participants