-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Feature Request: Allow disabling offload_op
for backends by user
#13241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am not against adding an option to disable this, that would be good, but I wonder if the issue here is that these operations on models with a high number of experts, simply should not be offloaded unless the batch size is much higher. If that's the case, this could be addressed by adding some heuristic to the |
Added a common option (-mobs, --min-offload-batch-size) that allows the user to specify a minimum batch size as well as disabling it completely with mobs=-1: hjc4869@5bc63bf ./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -mobs -1,0 -n 0
If this option is considered appropriate I'll send a PR to upstream the code. |
I don't think this implementation is good, it does not fit well in the design of the feature. A simple switch to completely disable |
Got it, let me replace that with a simple switch instead for now. |
Prerequisites
Feature Description
llama.cpp currently uses hardcoded minimum batch size = 32 and there's no option to disable offload_op unless the user specify -ub 16 or less manually. It would be great if the user can disable offload_op manually without reducing -ub.
Motivation
With the introduction of --override-tensor, it has become practical to offload experts to host DRAM in large MoEs while keeping the dense tensors on a GPU with relatively small VRAM. However, in the current implementation, prompt processing performance is not ideal in some configurations due to offload_op being used.
For example, when running llama4 400B with
-ot exps=CPU
with code from master branch, the prompt processing performance is extremely poor when -ub is set to 512 (default). When -ub is set to 16 it bypasses theoffload_op
in CUDA backend, but the performance is not fully on par with -ub 512 w/offload_op
disabled from source code.llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -ub 16,512
With
offload_op
changed to always return false in CUDA backend, there's a 10x performance boost../build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0
Possible Implementation
In ggml-backend.cpp, add some additional options and checks to the following offload_op call
The text was updated successfully, but these errors were encountered: