Feature Request: Allow disabling `offload_op` for backends by user #13241

hjc4869 · 2025-05-01T18:42:27Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

llama.cpp currently uses hardcoded minimum batch size = 32 and there's no option to disable offload_op unless the user specify -ub 16 or less manually. It would be great if the user can disable offload_op manually without reducing -ub.

Motivation

With the introduction of --override-tensor, it has become practical to offload experts to host DRAM in large MoEs while keeping the dense tensors on a GPU with relatively small VRAM. However, in the current implementation, prompt processing performance is not ideal in some configurations due to offload_op being used.

For example, when running llama4 400B with -ot exps=CPU with code from master branch, the prompt processing performance is extremely poor when -ub is set to 512 (default). When -ub is set to 16 it bypasses the offload_op in CUDA backend, but the performance is not fully on par with -ub 512 w/ offload_op disabled from source code.

llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -ub 16,512

model	size	params	backend	ngl	n_ubatch	type_k	type_v	fa	ot	test	t/s
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	16	q8_0	q8_0	1	exps=CPU	pp512	93.68 ± 0.70
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	16	q8_0	q8_0	1	exps=CPU	tg128	26.68 ± 0.07
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	512	q8_0	q8_0	1	exps=CPU	pp512	23.14 ± 0.01
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	512	q8_0	q8_0	1	exps=CPU	tg128	26.61 ± 0.16

With offload_op changed to always return false in CUDA backend, there's a 10x performance boost.
./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0

model	size	params	backend	ngl	type_k	type_v	fa	ot	mmap	test	t/s
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	pp512	233.66 ± 1.31
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	tg128	26.91 ± 0.10

Possible Implementation

In ggml-backend.cpp, add some additional options and checks to the following offload_op call

// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1 && ggml_backend_buffer_is_host(src->buffer)) {
    for (int b = 0; b < src_backend_id; b++) {
        if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
            SET_CAUSE(tensor, "1.off");
            return b;
        }
    }
}

The text was updated successfully, but these errors were encountered:

slaren · 2025-05-01T20:42:16Z

I am not against adding an option to disable this, that would be good, but I wonder if the issue here is that these operations on models with a high number of experts, simply should not be offloaded unless the batch size is much higher. If that's the case, this could be addressed by adding some heuristic to the offload_op function instead that increases the batch size depending on the number of experts.

hjc4869 · 2025-05-05T10:12:31Z

Added a common option (-mobs, --min-offload-batch-size) that allows the user to specify a minimum batch size as well as disabling it completely with mobs=-1: hjc4869@5bc63bf

./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -mobs -1,0 -n 0

model	size	params	backend	ngl	type_k	type_v	fa	ot	mmap	mobs	test	t/s
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	-1	pp512	233.62 ± 1.34
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	0	pp512	24.88 ± 0.02

If this option is considered appropriate I'll send a PR to upstream the code.

slaren · 2025-05-06T22:40:55Z

I don't think this implementation is good, it does not fit well in the design of the feature. A simple switch to completely disable offload_op in ggml_backend_sched may be ok, but any logic more complex than that belongs exclusively in the backend. ggml_op_batch_size is too hacky to move it to ggml core. I would still strongly prefer a improved heuristic in the offload_op implementation of the CUDA backend than adding more obscure command line parameters.

hjc4869 · 2025-05-07T00:38:42Z

Got it, let me replace that with a simple switch instead for now.

hjc4869 added the enhancement New feature or request label May 1, 2025

hjc4869 mentioned this issue May 8, 2025

Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B #13386

Merged

slaren linked a pull request May 9, 2025 that will close this issue

Add --no-op-offload to improve -ot pp perf in MoE models like llama4 400B #13386

Merged

slaren closed this as completed in #13386 May 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Allow disabling `offload_op` for backends by user #13241

Feature Request: Allow disabling `offload_op` for backends by user #13241

hjc4869 commented May 1, 2025

slaren commented May 1, 2025

hjc4869 commented May 5, 2025

slaren commented May 6, 2025

hjc4869 commented May 7, 2025

Feature Request: Allow disabling offload_op for backends by user #13241

Feature Request: Allow disabling offload_op for backends by user #13241

Comments

hjc4869 commented May 1, 2025

Prerequisites

Feature Description

Motivation

Possible Implementation

slaren commented May 1, 2025

hjc4869 commented May 5, 2025

slaren commented May 6, 2025

hjc4869 commented May 7, 2025

Feature Request: Allow disabling `offload_op` for backends by user #13241

Feature Request: Allow disabling `offload_op` for backends by user #13241