-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Request] adding auto mode to n-gpu-layers #3719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
i know it might get ignored as issue, but wanted to feature request since i hate the balancing |
If you have problems with particular value of ngl, try smaller values. Unless the model easily fits in VRAM, finding optimal value for ngl is a non-trivial task, and is usually done by trial and error. A formula that works well in most cases is probably not known, and I don't expect it to be found any time soon. One strategy that may work is to fill only 1/2 or 2/3 of VRAM with model parameters, leaving the rest for intermediate data and overhead. I personally use CLBlast back-end, because its VRAM requirements are more predictable than those of hipBLAS (another back-end that works with my GPU). |
i see, the sweet spot with rtx 2060 mobile (6 gigs of vram) and a 13gb model is 18 layers out of 40 layers, still though, after a long conversation (at 4k token) it gets cuda error and i have to reduce to 16 layer for it to even generate, and i find it frustrating since i have to load the model all over again since i offload the model into ram, but since you said |
I noticed that too with ROCm/hipBLAS back-end, which shares code with CUDA/cuBLAS back-end. It leaks or wastes VRAM. That's why I went back to OpenCL/CLBlast back-end for the time being. |
I wrote a script to give me a table of the max model size my system can load for each model I have. ` `
It's not error free and it will require changes to your circumstances. |
Any reason why a python equivalent of the above shell script
|
I'm interested in using a specific model (the 13b q4_k_m llama2 chat) with GPU. This is not a complete solution, just a record of some experiments I did. I don't know llama.cpp and C++ very well. The max memory requirement for the model is taken from https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF and used to calculate the memory per layer. There are probably better ways to get it with the llama.cpp classes instead but I'm not interested in looking into that right now. Running with llama-cli displays some info including that it has 41 layers so that's roughly 250MB per layer after the max ram required (10370MB) is divided by 41. After some tests I felt that this could be lower than the actual value so I increased it to 270 to be on the safe side. I'm playing with the Vulkan backend modifying the main.cpp/llama-cli example, using commit 7ea8d80 on a Windows 10 system with 8 cores and a dedicated graphics card with 8GB VRAM. First we include the ggml-vulkan header in main.cpp because it has the method ggml_backend_vk_get_device_memory that we need to get the total amount of VRAM:
Then in the main function start we can optionally set some parameters, use_mmap set to false seems to help display the actual RAM usage in task manager and LLAMA_SPLIT_MODE_NONE limits usage to a single gpu if there are multiple, I think. My system has a single GPU and it's outside the scope of this experiment to try guess the behavior on configurations I don't own:
Now let's call ggml_backend_vk_get_device_memory. We place the calls before llama_init_from_gpt_params is called.
It also returns a free value but they're both the same at the time of writing so now we guess the number of layers to offload to the GPU:
First gpu_total_mem is converted to MB and reduced by 1.2GB because there may be other running processes that require VRAM. Then if the GPU has at least 4GB VRAM we divide the amount of memory we plan to use by the memory per layer value to determine how many we can offload. If not it gets more tricky both in terms of calculating the number of layers and determining if it would improve performance. I think a more precise solution may be needed in that case and maybe also try CUDA instead since on my system I'm not sure I see any benefit using Vulcan for that specific model with 4GB VRAM except to reduce RAM usage by offloading layers to VRAM. Bonus round for anybody who might be interested in getting the type of the GPU. We could insert a method to return an int with the GPU type that can have values listed in the VkPhysicalDeviceType enum, online reference at https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkPhysicalDeviceType.html Add this to ggml-vulkan.h:
And then ggml-vulkan.cpp:
Then we could call it in main.cpp like:
This could then be used to create logic that limits the implementation only to configurations we're able to test and leave the rest to llama.cpp defaults. |
@svetlyo81 Does your method use additional data like from the page that you linked? |
@shibe2 Yes, please find the data below in case the link can't be opened. The physical device types which may be returned in VkPhysicalDeviceProperties::deviceType are:
VK_PHYSICAL_DEVICE_TYPE_OTHER - the device does not match any other available types. VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU - the device is typically one embedded in or tightly coupled with the host. VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU - the device is typically a separate processor connected to the host via an interlink. VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU - the device is typically a virtual node in a virtualization environment. VK_PHYSICAL_DEVICE_TYPE_CPU - the device is typically running on the same processors as the host. The physical device type is advertised for informational purposes only, and does not directly affect the operation of the system. However, the device type may correlate with other advertised properties or capabilities of the system, such as how many memory heaps there are. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected
llama.cpp
to do.adding
--auto-devices
causes to auto detect n-gpu-layers and balance it, or at least adding--gpu
do itCurrent Behavior
Please provide a detailed written description of what
llama.cpp
did, instead.it instead just doesn't work with it by default, and you have to specify
--n-gpu-layers x
and after a while of chatting, it gives cuda error with sillytavern + webui, or add too much layer and it becomes resource hungry and freeze the system 🤷Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu
The text was updated successfully, but these errors were encountered: