Skip to content

Granite Four #13550

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 71 commits into
base: master
Choose a base branch
from
Draft

Granite Four #13550

wants to merge 71 commits into from

Conversation

gabe-l-hart
Copy link
Contributor

@gabe-l-hart gabe-l-hart commented May 14, 2025

Description

This PR is the end-point for architecture support for Granite 4.0 (#13269 . It incorporates a number of changes from other in-flight branches that will need to be merged first:

Additionally, this PR replaces some work done on other PRs / branches:

Outstanding Questions

Besides the upstream PRs, there are a few questions to answer before this PR is merge ready:

  • This PR contains several changes to llama-kv-cache beyond those in feat: Hybrid unified/recurrent cache #13276, but they depend on the addition of hparams.recurrent_layer_arr which is only populated correctly if there is a valid model architecture to check against. Should I move all of these changes to the hybrid cache PR or keep them here where the model architectures become real?
  • Is there a more efficient way to implement hparams.recurrent_layer_arr? Using a max-layer-size std::array doesn't feel quite right.
  • There are still some numerical differences between the attention outputs when running Bamba and granite-4.0-tiny-shared-preview on this branch vs the respective draft branches, so I need to determine if this is due to changes in the attention implementation (ie "working as expected") or a bug somewhere.
  • The use of dymamic_cast to get the right cache type could be expensive (though it's likely negligible relative to the tensor math). Should we do something more clever to handle different cache types in llama-graph?
  • The switch statement for determining the type of KV cache to allocate in llama-model.cpp seems redundant with llama_model_is_recurrent and llama_model_is_hybrid. Should we use those functions instead and eliminate the duplicate logic and additional place to tweak for new recurrent / hybrid models?

Testing

To test out this branch, I've been using the following models:

Details

This PR has a lot of changes in it, some of which are isolated in the prereq-PRs above. In addition to the general mamba2 and llama_kv_cache_hybrid changes, this PR does the following:

python side

  • Add conversion support for BambaForCausalLM and GraniteMoeHybridForCausalLM
    • This includes one small tweak to gguf_writer.py that allows duplicate key/value pairs through add_key_value if (and only if) they match both value and type with the existing key. This is a convenience for hybrid models so that the converter doesn't need to rewrite the hparam conversion from multiple parents.
    • This also adds the new HybridAttention section under Keys in constants.py to hold attention.layer_indices. OPEN QUESTION: Should this just go under Attention?

c++ side

  • Add a new public API function llama_model_is_hybrid akin to llama_model_is_recurrent
    • I also split up both this function and llama_model_is_recurrent into llm_arch_is_* implemented in llama-arch.* and llama_model_is_* implemented in llama-model.*. This was done so that they could be used during model initialization before the model itself can be passed as the argument, specifically to determine how to populate hparams.recurrent_layer_arr (see below).
  • Add hparams.recurrent_layer_arr and support parsing it
    • The current implementation pre-allocates it as a fixed-length array which doesn't feel quite right.
  • Add an optional layer id to hparams.n_embd_k_s / hparams.n_embd_v_s
    • This is done because for hybrid models, the values may be different by layer.
    • I plumbed through as many usages of these methods as I could find to properly pass the layer index, but there are some places where it's not available which default to layer 0. This should be fine since none of those places interact with the hybrid caching.
  • Add hparams.recurrent_layer(uint32_t) to check whether a given layer is recurrent
  • Model name/param/arch plumbing for bamba and granitemoeshared in llama-arch.* (the boring part!)
  • (possibly breaking) Add hparams as an additional argument to the llama_model.create_memory method
    • This is done so the hparams can be given to the cache construction and used to determine which layers are recurrent for hybrid cache creation
  • In llama-graph, anywhere that a specific cache type needs to be fetched, it is grabbed using new methods get_recurrent_cache / get_unified_cache. These methods use dynamic_cast to handle both non-hybrid caches and hybrid caches.
  • Add support for instantiating the hybrid cache in llama-model.cpp
  • Add model support for bamba and granitemoehybrid in llama-model
    • Most of this is "business as usual," but that breaks down when trying to avoid code duplication for the hybrid architecture
    • To avoid code duplication, I hoisted build_mamba_layer / build_mamba2_layer from llm_build_mamba and build_attention_layer / build_layer_ffn from llm_build_granite into static methods on their respective classes. This makes for some gross function signatures where member data needs to be explicitly passed, but it allows the hybrid model architecture(s) to use these methods without complex inheritance.
    • I tried an alternative route using diamond inheritance, but this would have required some kind of "don't actually initialize the graph" switch in the parent model builders' constructors to avoid trying to build the parent model graphs during initialization of the hybrid class.

* ggml : improve ggml_mul speed when masking recurrent states
* ggml : make the ggml_mul fast broadcast path more consistently formatted
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
The max index is 31, so trimming the arguments is necessary.
Whoops, this is needed for the offset in the concatenated output.
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
This re-uses the Bamba code paths heavily and simply adds the missing parts
for loading MoE and the shared expert.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
…_mamba*_layer

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
… impl to use mixins

The challenge here is to give both the non-hybrid classes (llm_build_mamba
and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access
to the same intermediate "base class" functionality (build_mamba*_layer,
build_granite_attention_layer) without running into trouble with diamond
inheritance of llm_graph_context. Due to the non-trivial initialization
that happens in llm_graph_context, diamond inheritance results in multiple
initializations of the common base which cause problems around the unique
ptrs. I wanted to get away from `self->` everywhere, but this is still a
bit cleaner than making those methods static I think.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
…r builders

This follows the pattern where the type of input is pinned to the type of
memory and that is used to dispatch to the correct version of `build_rs` /
`build_attn`. There's a lot of code duplication that can hopefully be
pulled into common functions in the graph later.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
I've got back-and-forth a lot about how/if to try to implement reuse of the
"child model" layer types for hybrid models. At the end of the day, I think
hybrid models are their own beast and even if their layers are inspired by
other models, they should maintain control of their own layer building (in
other words, the copy-paste method). Given that, the name should reflect
that this is not a generic hybrid model builder, but rather a granite-
specific hybrid model builder that can do MoE (granite 4) or dense (bamba).

As part if this, I also cleaned up dangling comments from previous attempts
at using static methods for reusability.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart <[email protected]>
compilade and others added 3 commits June 26, 2025 17:58
Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON
* origin/compilade/mamba2: (29 commits)
mamba : fix mismatched new and delete size for llm_build_mamba
cuda : implement ssm scan for Mamba2
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
ggml : fix mamba2 ssm scan when compiled with SVE
graph : fix recurrent state copies when avoiding copies
kv-cache : allow context shift for recurrent models
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : remove const_cast when setting inputs for s_copy
metal : single-user mamba2 inference works
metal : add missing args for nb references in ssm_scan_f32_group
metal : fix confusion between ; and ,
convert : fix flake8 lint
ggml : avoid multiply by D in GGML_OP_SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
metal : fix wrong number of tokens per sequence in SSM_SCAN
metal : fix SSM_SCAN state head offset
metal : add back n_seqs to SSM_SCAN args
metal : remove unused arguments for SSM_SCAN
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : fix SSM_SCAN pipeline scope
...
* mamba2-sync: (22 commits)
recurrent : call balloc split_reset() in init_batch() (ggml-org#14414)
ggml : add ggml_set_rows (ggml-org#14274)
convert : fix broken sentencepiece vocab (ggml-org#14416)
mamba : fix mismatched new and delete size for llm_build_mamba
model : gemma3n text-only (ggml-org#14400)
cmake: regen vulkan shaders when shaders-gen sources change (ggml-org#14398)
llama : return mistral-v7-tekken as default template only (ggml-org#14390)
metal : add special-case mat-vec mul for ne00 == 4 (ggml-org#14385)
metal : batch rows copy in a single threadgroup (ggml-org#14384)
docs: update s390x documentation + add faq (ggml-org#14389)
musa: enable fp16 mma (all) and cublas on qy2 (ggml-org#13842)
ggml-cpu: enable IBM NNPA Vector Intrinsics (ggml-org#14317)
ggml : do not output unprintable characters on GGUF load failure (ggml-org#14381)
sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (ggml-org#13973)
opencl: ref count `ggml_backend_opencl_context` and refactor profiling (ggml-org#14254)
batch : fix check for empty sequences in memory (ggml-org#14364)
cmake : use LLAMA_BUILD_NUMBER when defining LLAMA_INSTALL_VERSION (ggml-org#14362)
server : move no API key doc to /health (ggml-org#14352)
main : honor --verbose-prompt on interactive prompts (ggml-org#14350)
jinja : Add Mistral-Small-3.2-24B-Instruct-2506.jinja (ggml-org#14349)
...
@gabe-l-hart
Copy link
Contributor Author

@compilade @ggerganov @AnmolS1 I'm going to move the conversation about hybrid cache seg faults over here to avoid cluttering the mamba2 branch since I'm pretty convinced it's a hybrid-only problem.

@gabe-l-hart
Copy link
Contributor Author

I think the issue may be a logic bug that is somehow being triggered due to the parallel prefill when running without -pps. The logic I think is flawed is:

  1. llama_memory_hybrid_context's init_update constructor (here) invokes init_update for both child caches, then calls llama_memory_status_combine on the status values for both children
  2. In llama_memory_recurrent, init_update always results in LLAMA_MEMORY_STATUS_NO_UPDATE (here), but in llama_kv_cache_unified, init_update will sometimes result in LLAMA_MEMORY_STATUS_SUCCESS (here)
  3. In llama_memory_status_combine, the logic will return LLAMA_MEMORY_STATUS_SUCCESS if either of the children are in the success state
  4. The constructor to llama_memory_hybrid_context does not later update the child caches' statuses after combining to set its own

So, the question in my mind is how the two statuses could end up different and what (if any) problems this would cause for the hybrid cache.

@gabe-l-hart
Copy link
Contributor Author

If I put a conditional check on ctx_recr->get_status() here, it seems to fix the issue. Will keep looking further.

@compilade
Copy link
Collaborator

compilade commented Jun 27, 2025

2. In llama_memory_recurrent, init_update always results in LLAMA_MEMORY_STATUS_NO_UPDATE (here), but in llama_kv_cache_unified, init_update will sometimes result in LLAMA_MEMORY_STATUS_SUCCESS (here)

The problem is probably caused when llama_context::kv_self_update is called for the KV cache, but there's nothing to apply for the recurrent cache.

If I put a conditional check on ctx_recr->get_status() here, it seems to fix the issue. Will keep looking further.

It definitely sounds like the recurrent cache's apply should not be called when its state is LLAMA_MEMORY_STATUS_NO_UPDATE.

I wonder if it would be simpler with a separate apply_update method which would unambiguously be a no-op when the sub-cache status is LLAMA_MEMORY_STATUS_NO_UPDATE, so that it could be called from llama_context::kv_self_update. But since the current recurrent cache design never requires separate updates, I guess it could be sufficient to make llama_memory_recurrent_context::apply always be a no-op when status == LLAMA_MEMORY_STATUS_NO_UPDATE.

Not sure if there's a situation where the same problem could happen with iswa caches (i.e. one of the sub-caches needs update while the other doesn't).

@gabe-l-hart
Copy link
Contributor Author

Ah, yeah, so that makes sense that we could solve this in the recurrent cache as well by simply making it a no-op if status is LLAMA_MEMORY_STATUS_NO_UPDATE rather than asserting success.

@gabe-l-hart
Copy link
Contributor Author

I'm wondering if the mutex fix that @AnmolS1 did was circumventing this situation where both caches were in different update sates somehow. That and the -pps flag are the only parts of this that still feel mysterious.

@gabe-l-hart
Copy link
Contributor Author

Ok, yeah, the logic in kv_self_update would definitely trigger this if the two caches had different statuses. I think it makes sense to contain this within the hybrid cache since the problem really lies with the fact that the two are out of sync and it's reasonable for the recurrent cache to assume that its apply will never be called if its status is NO_UPDATE based on the logic in kv_self_update.

I'll open a standalone PR to fix this on master.

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Jun 27, 2025
There are conditions where the two child conditions can end up with
different status values based on the logic in the init_update constructor
for llama_kv_cache_unified_context which can conditionally set status to
either LLAMA_MEMORY_STATUS_SUCCESS or LLAMA_MEMORY_STATUS_NO_UPDATE.

See full discussion:
ggml-org#13550 (comment)

Branch: HybridCacheApplyLogic

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart
Copy link
Contributor Author

Fix PR: #14428

ggerganov and others added 3 commits June 29, 2025 10:15
* origin/master:
metal : disable fast-math for some cpy kernels (ggml-org#14460)
ggml-cpu: sycl: Re-enable exp f16 (ggml-org#14462)
test-backend-ops : disable llama test (ggml-org#14461)
cmake : Remove redundant include path in CMakeLists.txt (ggml-org#14452)
scripts : make the shell scripts cross-platform (ggml-org#14341)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (ggml-org#13196)
server : fix appearance of the chats list context menu for Safari (ggml-org#14322)
SYCL: disable faulty fp16 exp kernel (ggml-org#14395)
ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (ggml-org#14443)
ggml : implement REGLU/GEGLU/SWIGLU ops (ggml-org#14158)
vulkan: Add fusion support for RMS_NORM+MUL (ggml-org#14366)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (ggml-org#14361)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (ggml-org#14378)
vulkan: lock accesses of pinned_memory vector (ggml-org#14333)
model : add support for ERNIE 4.5 0.3B model (ggml-org#14408)
fix async_mode bug (ggml-org#14432)
ci : fix windows build and release (ggml-org#14431)
vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (ggml-org#14427)
graph : make llm_graph_context destructor virtual (ggml-org#14410)
* origin/gg/memory-is-fail:
memory : correctly handle failure in apply()
* origin/master:
memory : correctly handle failure in apply() (ggml-org#14438)
* origin/master:
Add Vulkan images to docker.md (ggml-org#14472)
CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (ggml-org#14411)
vulkan: Split large mul_mat_id to fit in shared memory (ggml-org#14451)
add GELU_ERF (ggml-org#14455)
ggml : remove trailing whitespace (#0)
sync : ggml
ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
ggml-quants : rename best_mad to best_error (ggml/1283)
opencl : add GEGLU, REGLU, SWIGLU (ggml-org#14456)
Add Conv2d for CPU (ggml-org#14388)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants