Increase CPU & GPU execution overlap of GGML Backends through the new Graph Plan API #14514

mtavenrath · 2025-07-03T09:23:16Z

mtavenrath
Jul 3, 2025

Introduction:

Currently, the computational graph execution in some GGML backends results in high CPU cost, leading to prolonged GPU idle times. Optimizing inference across diverse hardware platforms necessitates addressing this core inefficiency.

We're primarily looking at two distinct paradigms for backend execution: immediate-mode backends like CUDA without graphs, and command-list-based backends such as Vulkan or CUDA with graphs.

Immediate-mode backends typically process the node list and execute work as they go. Kernel launch overheads are incurred for each kernel. Command-list-based backends, however, must build the entire command list before execution. This can introduce a few milliseconds of delay, during which your GPU sits idle, hurting overall performance.

Here's how these timelines typically look:

Immediate Mode Execution Timeline:

GPU: -12345
CPU: 12345-

Immediate Mode Execution Timeline:

GPU: -----12345
CPU: 12345-----

To minimize this GPU idle time, a common strategy is to split the commands into multiple, smaller command lists (or batches) and flush each of these lists as soon as they are ready. This allows for a more continuous stream of work to the GPU, combining the performance benefits of command lists (e.g., fewer kernel launches, better optimization opportunities for each batch) with the low-latency advantages of immediate-mode execution.

Flushing Command-List Mode Execution Timeline

Flushing every 2 commands
GPU: --12345
CPU: 12345--

This technique has already proven successful. For instance, it was implemented in the Vulkan backend via PR 9118, leading to a 10% improvement in token generation performance for StableLM 3B Q8_0 on high-end CPUs (like the Threadripper Pro 7975WX). More recently, PR 11867 for the CUDA backend showed a 2.5% to 7% improvement for Phi3-mini-4k-instrukt 3B on high-end CPUs (Threadripper Pro 5955WX and Intel 14900K). We expect even greater benefits on lower-specced CPUs.

Improving with the Graph Plan API

With a Graph Plan API, outlined in the ggml-backend-impl.h header, one could reduce CPU overhead even more by having an interface which can give more guarantees about graph changes and also allows the developer to deliver more information to the backend about graph changes. In total this will enable smart optimizations on the backend side.
Here’s the current API structure:

// (optional) graph plans (not used currently)

// compute graph with a plan
ggml_backend_graph_plan_t (*graph_plan_create) (ggml_backend_t backend, const struct ggml_cgraph * cgraph);

void (*graph_plan_free) (ggml_backend_t backend, ggml_backend_graph_plan_t plan);

// update the plan with a new graph - this should be faster than creating a new plan when the graph has the same topology
void (*graph_plan_update) (ggml_backend_t backend, ggml_backend_graph_plan_t plan, const struct ggml_cgraph * cgraph);

// compute the graph with the plan
enum ggml_status (*graph_plan_compute)(ggml_backend_t backend, ggml_backend_graph_plan_t plan);

While this API has significant potential, its current design for truly asynchronous update and compute functions would require double buffering of internal graph/command buffer data structures. This raises several concerns:

Tensor Ownership: Who owns the tensors passed to create/update?
Tensor Size Changes: How do we handle changes in tensor sizes? We can't simply resize existing tensors if they're still in use by a previous compute call. Duplicating tensors would consume more memory, and if allocations become necessary during compute execution, everything might become synchronous anyway due to potential stalls.
Backend Internal Data Structure Duplication: Backends with real graph-based interfaces might have huge internal data structures that would need to be duplicated. Resource sharing between these duplicated structures could also become problematic.

Unless a solid solution emerges for these asynchronous issues, my recommendation is to keep the Graph Plan API synchronous.
Even with a synchronous API, having a way to capture a ggml_cgraph can be very beneficial if designed to convey crucial information to the backend, allowing it to skip unnecessary work. We should consider at least two types of graph changes:

Topological changes: When nodes are added or removed.
Tensor size changes: This is a more nuanced category. Some tensor shapes dynamically depend on input sizes (e.g., sequence length, batch size). However, not all tensor shapes in a graph are necessarily static or directly dependent on the primary input dimensions; some might change based on internal model logic, padding, or intermediate computations.

A crucial optimization when the topology is static is that the backend can identify and internally store only the graph nodes it actively uses. This allows it to avoid traversing any nodes irrelevant to its operations, significantly reducing CPU work, cache misses, and branch mispredictions. This also enables the creation of highly optimized, backend-specific data structures for each relevant graph node.
One perspective on a graph plan could be that its topology is always static, and only tensor shape sizes can change. This would open the door to advanced optimizations like operator fusion, further boosting performance.
To make this more flexible, we could add flags to specify the types of changes allowed in the graph:

TOPOLOGY_CHANGES Allows any topological changes (node additions/removals). If this is permitted, a backend would always have to traverse the full graph passed to compute before executing it, much like how CUDA graphs currently operate, as the structure itself is unpredictable. This represents the least optimized scenario.
TENSOR_CHANGES Allows changing the specific tensors attached to operations. In this scenario, we can assume the topology of the graph is static, allowing the backend to pre-create a list of operators in execution order and apply the static-topology optimizations mentioned above. H
TENSOR_SHAPES Guarantees that only the shape sizes of existing tensors change. This specifically means the dimensions of tensors might vary (e.g., from [1, 512] to [1, 1024]) while the number of dimensions and the overall graph structure remain constant. This is a common scenario in models with dynamic sequence lengths. If this guarantee can be made, it would allow us to skip the entire graph tensor traversal, simply fetching new tensor sizes from a list of tensors used by the backend. This provides the highest level of optimization, as both topology and tensor pointers are static.

With these additions to the graph properties, the compute function could also receive additional flags indicating which of the allowed changes actually occurred. This wouldn't just enable the optimizations from above; it would also allow us to skip as much work as possible, saving CPU power.

Here’s how the enhanced API could look:

// Enum to specify allowed graph changes for a backend plan.
// These flags inform the backend about the expected dynamism of the graph,
// enabling it to apply appropriate optimizations.
typedef enum {
    GGML_GRAPH_CHANGES_NONE         = 0,      // Guarantees static topology, static tensor pointers, and static tensor shapes.
                                              // Enables maximum backend optimization (e.g., full graph capture).
    GGML_GRAPH_CHANGES_TENSOR_SHAPES = (1 << 0), // Guarantees static topology and static tensor pointers.
                                              // Only tensor *shape sizes* can change for existing tensors.
                                              // Backend can optimize by only updating shape information.
    GGML_GRAPH_CHANGES_TENSORS      = (1 << 1), // Guarantees static topology.
                                              // Tensor *pointers* attached to operations can change, as can their shapes.
                                              // Backend can optimize execution order, but must fetch tensor info per-execution.
    GGML_GRAPH_CHANGES_TOPOLOGY     = (1 << 2), // Allows arbitrary topology changes (nodes added/removed).
                                              // Represents the least optimized scenario, requiring full graph traversal.
} ggml_graph_changes;

// Creates a backend graph plan for the given compute graph.
// The 'allowed_changes' flag informs the backend about the maximum
// dynamism expected for this plan instance, guiding its internal
// optimization strategy.
ggml_backend_graph_plan_t (*graph_plan_create) (ggml_backend_t backend, const struct ggml_cgraph * cgraph, ggml_graph_changes allowed_changes);

// Frees a previously created backend graph plan.
void (*graph_plan_free) (ggml_backend_t backend, ggml_backend_graph_plan_t plan);

// Computes the graph using the provided plan.
// 'actual_changes' informs the backend about what *actually* changed
// since the last compute call, allowing for minimal work execution.
enum ggml_status (*graph_plan_compute)(ggml_backend_t backend, ggml_backend_graph_plan_t plan, ggml_graph_changes actual_changes);

By passing allowed_changes during plan creation and actual_changes during computation, we can provide the backend with precise information, enabling it to execute only the necessary work and significantly improve overall efficiency.

slaren · 2025-07-03T11:36:36Z

slaren
Jul 3, 2025
Maintainer

We can add a function that does an update and compute at the same time. This would allow the backend to split the CUDA graph in as many parts as it wants, to reduce the time that the GPU is idle while the graph is being captured. This can help reduce power and memory usage, but for maximum performance, I expect that using two graphs and updating one asynchronously while the other one is executing would provide the best results. With @ggerganov work in #14482, I expect that we will be able to avoid most graph updates, so in the typical case we can call graph_plan_compute repeatedly, which would solve most of these concerns.

2 replies

ggerganov Jul 4, 2025
Maintainer

I expect that we will be able to avoid most graph updates, so in the typical case we can call graph_plan_compute repeatedly, which would solve most of these concerns.

Yes, that's my understanding too. With n_kv padding of 256 we would reuse typically during text generation the same graph (both ggml and CUDA) all the time and would have to construct a new one only once every 256th token.

What is not completely clear to me still is when the 256th token occurs and the n_kv jumps, what is our best strategy. I thought that the overhead would be negligible because this happens very rarely, but I think @mtavenrath suggested that it's not the case and we want to do something to mitigate even this overhead.

For the ggml graph specifically in this case, I think we can easily construct it asynchronously, as if we assumed that n_kv has already jumped, before it actually happens, so that we have the graph ready beforehand. But I am not sure what would be the mechanism to apply the tensor shape changes of this new graph to the CUDA graph. I think this is where the graph_plan_update() has to be used, but I am not completely seeing it.

And we also need to call ggml_backend_sched_reset() + ggml_backend_sched_alloc_graph() inevitably when a shape changes, correct?

slaren Jul 4, 2025
Maintainer

But I am not sure what would be the mechanism to apply the tensor shape changes of this new graph to the CUDA graph. I think this is where the graph_plan_update() has to be used, but I am not completely seeing it.

I don't think there is anything to do here other than call graph_plan_update and let the backend figure that only the shapes changed.

And we also need to call ggml_backend_sched_reset() + ggml_backend_sched_alloc_graph() inevitably when a shape changes, correct?

You need to do that every time you create a new graph, otherwise the tensors won't even be allocated. To refresh the usage of these functions:

ggml_backend_sched_reset resets tensor assignments, which lets you use functions such as ggml_backend_sched_set_tensor_backend before allocating a graph. Generally this is something that you need to do before starting to create a new graph.
ggml_backend_sched_alloc_graph splits the graph and allocates the tensors in the graph. Before this, the tensors won't even have an address.

mtavenrath · 2025-07-04T12:13:03Z

mtavenrath
Jul 4, 2025
Author

I am still slightly concerned recreating a ggml graph every 256 tokens would work for APIs like Vulkans new SPIR-V graph API extension, https://registry.khronos.org/vulkan/specs/latest/man/html/VK_ARM_data_graph.html, or TRT-RTX.

While Vulkans API currently supports fixed shapes only TRT-RTX supports dynamic shapes with a min/max configuration. This way allocation is fixed for all iterations potentially avoiding the need to call ggml_backend_sched_reset() ggml_backend_sched_alloc_graph(). Some documentation about torch-trt supports dynamic shapes can be found here: https://docs.pytorch.org/TensorRT/_notebooks/dynamic-shapes.html#3. In the CUDA and Vulkan backend a worst-case would be a valid option as well.

On the kv-cache use case, is there any other change to the graph than tensor sizes? If not, what would prevent us from following the min/max tensor size pattern and ensure the allocation is sufficiently large up to the max tensor size specified so that all that has to be done after a graph update is updating the tensor shapes?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase CPU & GPU execution overlap of GGML Backends through the new Graph Plan API #14514

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Increase CPU & GPU execution overlap of GGML Backends through the new Graph Plan API #14514

Uh oh!

Uh oh!

mtavenrath Jul 3, 2025

Introduction:

Improving with the Graph Plan API

Replies: 2 comments · 2 replies

Uh oh!

slaren Jul 3, 2025 Maintainer

Uh oh!

ggerganov Jul 4, 2025 Maintainer

Uh oh!

Uh oh!

slaren Jul 4, 2025 Maintainer

Uh oh!

mtavenrath Jul 4, 2025 Author

mtavenrath
Jul 3, 2025

Replies: 2 comments 2 replies

slaren
Jul 3, 2025
Maintainer

ggerganov Jul 4, 2025
Maintainer

slaren Jul 4, 2025
Maintainer

mtavenrath
Jul 4, 2025
Author