-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Feature Request: tensor split needs control over where CPU layers go #13314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I like it) Also it would be useful if we could control the number of layers passed to each device because now, AFAIK, the split is even and equals to For example, we could use something like Probably it would not too bad to expand the Something like: The exact syntax is under discussion |
Can't you use |
Yes, it could be done but its very messy compared to letting the tensor split logic figure out where to store the data. Its trivial to add a --tensor-split-cpu-last flag to control that section of code I illustrated in my note. Its a little more work to add a layer sort prior to tensor split but I don't think its a lot of work and could also be quite beneficial. |
That can be done right now with --tensor-split. I am forced to use that with the hybrid quants to get a better balanced waterfill on my 3 GPUs, the automatic split logic is not up to the task of handling the nonuniform sized layers well. I think a preliminary layer size sort might help this since smaller sized layers have lower granularity by definition, but I have not done a POC on that, just throwing the concept out there for discussion. --tensor-split-cpu-last is a trivial implementation though and I know will work without any POC needed. |
Great! Didn't know. But is there a way to describe the layers storage order? What if we want, for example, to split 48 layers as:
There should be the way to explain to llama exactly how it should spread the layers, including such exotic cases:
YAML config maybe, for the simplicity, as the addition to CLI params?
|
The -ot can handle that for instance
sends layers 26 to 49 to CPU. You can specify multiple -ot commands to manually override where every layer goes. But this extremely user non-friendly. User friendly would be to not have to worry about -ot or --tensor-split and just have the automatic splitter do an optimal waterfill. It might make sense to be able to specify the order of devices in the auto split which is now arbitrary starting in CPU, going to RPC, and ending in local GPU, something like --tensor-split-order RPC0,RPC1,CUDA0,CPU and just let it waterfill accordingly to the device memory capacitoies in the user specified order. In my case for the hybrid quants all I really needed was the ability to load to CPU last instead of first so --tensor-split-cpu-last is both simple and sufficient and trivial to implement. |
Wow. Lucky we are for now we may ask the LLM to invent such the regexps 😬
Nothing to worry about. They are completely LLM-friendly |
Haha good one :) |
It would be fine if we could limit the RAM usage and/or layers number for each device from main server's side. For the cases may be where the complete RAM fillup is not good P.S. |
Yes I don't think there is now a way to override the max VRAM desired and that could be useful. --tensor-split just lets you set allocation ratios. I have been doing some tricky 3 GPU RPC stuff with a speculator on the main GPU and it turns into a nightmare 3 body problem to solve every time I regen a new hybrid quant with a different weight distribution vs layer. Equivalently it would be useful to have the ability to reserve some amount of memory on the GPU devices. This can now be controlled coarsly by just specifying a smaller NGL, but not on a fine grained basis. For instance if I know I want to offload a speculator on main GPU I can ask to reserve space for the known size of the weights of that speculator + a budget for its KV and the waterfill would then know it can't go above the max device size - the reserve I gave it. |
There's a q&d workaround: some ssh command like BTW the local GPU[s] is/are also usable in such a way. Not extremely beautiful, but while you keep it in secret from anyone who's respect is important to you ... Who whould dare to forbid? 😁 |
Sounds like a great corrollary to Murphys law : If any problem has at least one solution it also has an ugly solution. |
Prerequisites
Feature Description
Provide a command argument switch that allows user to select whether CPU layers are loaded first or last when assigning tensor splits.
Motivation
Background: I am testing out hybrid layer quants for GGUFs #13040. The basic idea is to use smaller quants at early layers and bigger quants at cortex layers. The goal is to reduce gguf size to enable either more layers to offload to GPU(s) or open up space for bigger KV in the GPU(s) while maintaing high performance at the same time.
Problem: when using CPU + GPU, the tensor split is currently hardcoded to assign CPU to the first layers in the gguf. This is exactly the opposite of optimal for a GGUF where the first layers are smaller due to using smaller quants (Murphys law wins again). By loading the early smaller layers into GPU instead more layers can offload. As an example the Q2_K_H Llama scout hybrid quant :
https://huggingface.co/steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF/resolve/main/Llama-4-Scout-17B-16E-Instruct.Q2_K_H.gguf
will only offload 34 layers into 3x 4070 on a local lan RPC setup. The bigger cortex layers are being allocated into the GPUs so there is less room for more layers or KV :
Proposal #1: a control flag to select whether to offload to CPU first or last. In llama-model.cpp, changing this one line will result in the last layers going into CPU and early layers into GPU devices:
// const int i_gpu_start = std::max((int) hparams.n_layer - n_gpu_layers, (int
const int i_gpu_start = 0;
This change will allow offload of 38 layers instead of 34, or open up more room in GPUs for bigger KV:
Proposal #2 : First sort the layers by size, then apply proposal #1 control switch to enable smaller layers to go into GPU first. With hybrid quants the layer size differences can be significant (2x or more). This could result in non sequential layers going into a device though and I am not sure about the implications of that.
Possible Implementation
--tensor-split-cpu-last flag controls if the CPU is offloaded to last.
--tensor-split-layer-sort flag controls if the layers are first sorted by size before doing the tensor split
The text was updated successfully, but these errors were encountered: