Skip to content

Feature Request: tensor split needs control over where CPU layers go #13314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks done
steampunque opened this issue May 5, 2025 · 12 comments
Open
4 tasks done
Labels
enhancement New feature or request

Comments

@steampunque
Copy link

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Provide a command argument switch that allows user to select whether CPU layers are loaded first or last when assigning tensor splits.

Motivation

Background: I am testing out hybrid layer quants for GGUFs #13040. The basic idea is to use smaller quants at early layers and bigger quants at cortex layers. The goal is to reduce gguf size to enable either more layers to offload to GPU(s) or open up space for bigger KV in the GPU(s) while maintaing high performance at the same time.

Problem: when using CPU + GPU, the tensor split is currently hardcoded to assign CPU to the first layers in the gguf. This is exactly the opposite of optimal for a GGUF where the first layers are smaller due to using smaller quants (Murphys law wins again). By loading the early smaller layers into GPU instead more layers can offload. As an example the Q2_K_H Llama scout hybrid quant :
https://huggingface.co/steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF/resolve/main/Llama-4-Scout-17B-16E-Instruct.Q2_K_H.gguf
will only offload 34 layers into 3x 4070 on a local lan RPC setup. The bigger cortex layers are being allocated into the GPUs so there is less room for more layers or KV :

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 1
load_tensors: layer   1 assigned to device CPU, is_swa = 1
load_tensors: layer   2 assigned to device CPU, is_swa = 1
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 1
load_tensors: layer   5 assigned to device CPU, is_swa = 1
load_tensors: layer   6 assigned to device CPU, is_swa = 1
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 1
load_tensors: layer   9 assigned to device CPU, is_swa = 1
load_tensors: layer  10 assigned to device CPU, is_swa = 1
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 1
load_tensors: layer  13 assigned to device CPU, is_swa = 1
load_tensors: layer  14 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  15 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer  16 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  17 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  18 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  19 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer  20 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  21 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  22 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  23 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer  24 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  25 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  26 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  27 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  28 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  29 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  30 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  31 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  32 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  33 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  34 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  35 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  36 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  37 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  38 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  39 assigned to device CUDA0, is_swa = 0
load_tensors: layer  40 assigned to device CUDA0, is_swa = 1
load_tensors: layer  41 assigned to device CUDA0, is_swa = 1
load_tensors: layer  42 assigned to device CUDA0, is_swa = 1
load_tensors: layer  43 assigned to device CUDA0, is_swa = 0
load_tensors: layer  44 assigned to device CUDA0, is_swa = 1
load_tensors: layer  45 assigned to device CUDA0, is_swa = 1
load_tensors: layer  46 assigned to device CUDA0, is_swa = 1
load_tensors: layer  47 assigned to device CUDA0, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q3_K) (and 198 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloaded 34/49 layers to GPU

Proposal #1: a control flag to select whether to offload to CPU first or last. In llama-model.cpp, changing this one line will result in the last layers going into CPU and early layers into GPU devices:

// const int i_gpu_start = std::max((int) hparams.n_layer - n_gpu_layers, (int
const int i_gpu_start = 0;

This change will allow offload of 38 layers instead of 34, or open up more room in GPUs for bigger KV:

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   1 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   2 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   3 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer   4 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   5 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   6 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   7 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer   8 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   9 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  10 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  11 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer  12 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  13 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  14 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  15 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  16 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  17 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  18 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  19 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  20 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  21 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  22 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  23 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  24 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  25 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  26 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 1
load_tensors: layer  29 assigned to device CUDA0, is_swa = 1
load_tensors: layer  30 assigned to device CUDA0, is_swa = 1
load_tensors: layer  31 assigned to device CUDA0, is_swa = 0
load_tensors: layer  32 assigned to device CUDA0, is_swa = 1
load_tensors: layer  33 assigned to device CUDA0, is_swa = 1
load_tensors: layer  34 assigned to device CUDA0, is_swa = 1
load_tensors: layer  35 assigned to device CUDA0, is_swa = 0
load_tensors: layer  36 assigned to device CUDA0, is_swa = 1
load_tensors: layer  37 assigned to device CUDA0, is_swa = 1
load_tensors: layer  38 assigned to device CPU, is_swa = 1
load_tensors: layer  39 assigned to device CPU, is_swa = 0
load_tensors: layer  40 assigned to device CPU, is_swa = 1
load_tensors: layer  41 assigned to device CPU, is_swa = 1
load_tensors: layer  42 assigned to device CPU, is_swa = 1
load_tensors: layer  43 assigned to device CPU, is_swa = 0
load_tensors: layer  44 assigned to device CPU, is_swa = 1
load_tensors: layer  45 assigned to device CPU, is_swa = 1
load_tensors: layer  46 assigned to device CPU, is_swa = 1
load_tensors: layer  47 assigned to device CPU, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q3_K) (and 142 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 38 repeating layers to GPU
load_tensors: offloaded 38/49 layers to GPU

Proposal #2 : First sort the layers by size, then apply proposal #1 control switch to enable smaller layers to go into GPU first. With hybrid quants the layer size differences can be significant (2x or more). This could result in non sequential layers going into a device though and I am not sure about the implications of that.

Possible Implementation

--tensor-split-cpu-last flag controls if the CPU is offloaded to last.
--tensor-split-layer-sort flag controls if the layers are first sorted by size before doing the tensor split

@steampunque steampunque added the enhancement New feature or request label May 5, 2025
@balaccord
Copy link

I like it)

Also it would be useful if we could control the number of layers passed to each device because now, AFAIK, the split is even and equals to -ngl / number of devices

For example, we could use something like -nl $device1,$device2,... or -nl $device1 -nl $device2 -nl ...
The new option name is so we use the good old -ngl for the backward compatibility

Probably it would not too bad to expand the nl option behavior to the local resources such as CPUs, GPUs and so on
This is your idea, but in a more general way

Something like: -nl cpu1,cpu2,...,gpu1,gpu2,...,rpc1,rpc2
Or: -nl-cpu 0:$ncpu0,1:$ncpu1 -nl-rpc 0:$nrpc0 ...

The exact syntax is under discussion

@Nepherpitou
Copy link

Can't you use --override-tensor flag to manually assign layer to preferred device? More tricky, but will solve your issue right now.

@steampunque
Copy link
Author

Can't you use --override-tensor flag to manually assign layer to preferred device? More tricky, but will solve your issue right now.

Yes, it could be done but its very messy compared to letting the tensor split logic figure out where to store the data. Its trivial to add a --tensor-split-cpu-last flag to control that section of code I illustrated in my note. Its a little more work to add a layer sort prior to tensor split but I don't think its a lot of work and could also be quite beneficial.

@steampunque
Copy link
Author

I like it)

Also it would be useful if we could control the number of layers passed to each device because now, AFAIK, the split is even and equals to -ngl / number of devices

That can be done right now with --tensor-split. I am forced to use that with the hybrid quants to get a better balanced waterfill on my 3 GPUs, the automatic split logic is not up to the task of handling the nonuniform sized layers well. I think a preliminary layer size sort might help this since smaller sized layers have lower granularity by definition, but I have not done a POC on that, just throwing the concept out there for discussion. --tensor-split-cpu-last is a trivial implementation though and I know will work without any POC needed.

@balaccord
Copy link

balaccord commented May 7, 2025

That can be done right now with --tensor-split.

Great! Didn't know. But is there a way to describe the layers storage order?

What if we want, for example, to split 48 layers as:

  • GPU0: 30, RPC0: 10, CPU0: 999 (the rest)
  • CPU0: 6, RPC0: 12, GPU0: 999
  • ...

There should be the way to explain to llama exactly how it should spread the layers, including such exotic cases:

  • GPU0: 11, RPC0: 6, GPU0: 8, CPU0: 2, RPC0: 4, ...

YAML config maybe, for the simplicity, as the addition to CLI params?
Something like

layers:
    - gpu0: 1-6, 12-16
    - rpc0: 7-11, 17-30, 33-35
    - cpu0: 36-41
    - gpu1: *

@steampunque
Copy link
Author

That can be done right now with --tensor-split.

There should be the way to explain to llama exactly how it should spread the layers, including such exotic cases:

* GPU0: 11, RPC0: 6, GPU0: 8, CPU0: 2, RPC0: 4, ...

The -ot can handle that for instance

OT='-ot blk\.(2[6-9]|[3-4][0-9]).*=CPU

sends layers 26 to 49 to CPU. You can specify multiple -ot commands to manually override where every layer goes. But this extremely user non-friendly. User friendly would be to not have to worry about -ot or --tensor-split and just have the automatic splitter do an optimal waterfill. It might make sense to be able to specify the order of devices in the auto split which is now arbitrary starting in CPU, going to RPC, and ending in local GPU, something like --tensor-split-order RPC0,RPC1,CUDA0,CPU and just let it waterfill accordingly to the device memory capacitoies in the user specified order. In my case for the hybrid quants all I really needed was the ability to load to CPU last instead of first so --tensor-split-cpu-last is both simple and sufficient and trivial to implement.

@balaccord
Copy link

The -ot can handle that for instance

Wow. Lucky we are for now we may ask the LLM to invent such the regexps 😬

extremely user non-friendly

Nothing to worry about. They are completely LLM-friendly

@steampunque
Copy link
Author

The -ot can handle that for instance

Wow. Lucky we are for now we may ask the LLM to invent such the regexps 😬

extremely user non-friendly

Nothing to worry about. They are completely LLM-friendly

Haha good one :)

@balaccord
Copy link

balaccord commented May 7, 2025

something like --tensor-split-order RPC0,RPC1,CUDA0,CPU and just let it waterfill accordingly

It would be fine if we could limit the RAM usage and/or layers number for each device from main server's side. For the cases may be where the complete RAM fillup is not good

P.S.
Aha! --tensor-split-order RPC0:20G,RPC1:12G,CUDA0:22G,CPU

@steampunque
Copy link
Author

steampunque commented May 7, 2025

something like --tensor-split-order RPC0,RPC1,CUDA0,CPU and just let it waterfill accordingly

It would be fine if we could limit the RAM usage and/or layers number for each device from main server's side. For the cases may be where the complete RAM fillup is not good

P.S. Aha! --tensor-split-order RPC0:20G,RPC1:12G,CUDA0:22G,CPU

Yes I don't think there is now a way to override the max VRAM desired and that could be useful. --tensor-split just lets you set allocation ratios. I have been doing some tricky 3 GPU RPC stuff with a speculator on the main GPU and it turns into a nightmare 3 body problem to solve every time I regen a new hybrid quant with a different weight distribution vs layer.

Equivalently it would be useful to have the ability to reserve some amount of memory on the GPU devices. This can now be controlled coarsly by just specifying a smaller NGL, but not on a fine grained basis. For instance if I know I want to offload a speculator on main GPU I can ask to reserve space for the known size of the weights of that speculator + a budget for its KV and the waterfill would then know it can't go above the max device size - the reserve I gave it.

@balaccord
Copy link

balaccord commented May 7, 2025

I don't think there is now a way to override the max VRAM desired

There's a q&d workaround: some ssh command like killall rpc-server && sleep 3 && rpc-server -c $THIS_RPC_RAM_LIMIT

BTW the local GPU[s] is/are also usable in such a way. Not extremely beautiful, but while you keep it in secret from anyone who's respect is important to you ... Who whould dare to forbid? 😁

@steampunque
Copy link
Author

I don't think there is now a way to override the max VRAM desired

There's a q&d workaround: some ssh command like killall rpc-server && sleep 3 && rpc-server -c $THIS_RPC_RAM_LIMIT

BTW the local GPU[s] is/are also usable in such a way. Not extremely beautiful, but while you keep it in secret from anyone who's respect is important to you ... Who whould dare to forbid? 😁

Sounds like a great corrollary to Murphys law : If any problem has at least one solution it also has an ugly solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants