Skip to content

llama: add initial support for Falcon-H1 model family #14534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 53 commits into
base: master
Choose a base branch
from

Conversation

ibrahimkhadraoui
Copy link

@ibrahimkhadraoui ibrahimkhadraoui commented Jul 4, 2025

Fixes: #13681
Summary
• Add initial support for the Falcon-H1 model family.
• Implements model loading, basic inference, and tokenizer integration for Falcon-H1.
• Updates the build scripts and relevant documentation for Falcon-H1 compatibility.

Details
• Adapted model architecture and layer mapping to match Falcon-H1.
• Integrated Falcon-H1 tokenizer with automatic fallback if tokenizer files are missing.
• Added new test cases to verify Falcon-H1 model loading and inference.
• Cleaned up redundant code from previous Falcon integration attempts.

Notes
• The Falcon-H1 integration follows the same approach as other model families (see llama and Mamba support).
• This supersedes #14238 with a cleaner and more modular implementation.
• Refer to the Falcon-H1 repo for model weights and tokenizer files.

@github-actions github-actions bot added the python python script changes label Jul 4, 2025
@jacekpoplawski
Copy link

Is the old gguf still valid or new one must be generated?

@ibrahimkhadraoui
Copy link
Author

@jacekpoplawski
You need to re convert with hf_to_gguf we did some changes on the data types

@ibrahimkhadraoui ibrahimkhadraoui marked this pull request as ready for review July 7, 2025 13:38
@ibrahimkhadraoui
Copy link
Author

Hello @ggerganov and @compilade,

My colleague @younesbelkada and I have addressed all the comments you provided. In addition, we’ve introduced a few enhancements to further improve the implementation.

Let us know if you have any further feedback!

Comment on lines +689 to +696
if (
chkhsh == "60476e1243776c4fb1b993dbd7a5f15ac22f83c80afdf425fa5ae01c8d44ef86" or
chkhsh == "3eda48b4c4dc7de733d1a8b3e3b4a85243dbbf704da2ee9d42c6beced8897896" or
chkhsh == "48f8e02c0359c0bbdd82f26909171fac1c18a457bb47573ed1fe3bbb2c1cfd4b" or
chkhsh == "a6b57017d60e6edb4d88ecc2845188e0eb333a70357e45dcc9b53964a73bbae6"
):
# ref: https://huggingface.co/collections/tiiuae/falcon-h1-6819f2795bc406da60fab8df
res = "falcon_h1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have multiple hashes here?

This section should be generated by the convert_hf_to_gguf_update.py script and it will be overwritten the next time we run it.

@@ -288,6 +289,7 @@ class MODEL_ARCH(IntEnum):
LLAMA4 = auto()
DECI = auto()
FALCON = auto()
FALCON_H1 = auto()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
FALCON_H1 = auto()
FALCON_H1 = auto()

@@ -660,6 +662,7 @@ class MODEL_TENSOR(IntEnum):
MODEL_ARCH.DOTS1: "dots1",
MODEL_ARCH.ARCEE: "arcee",
MODEL_ARCH.ERNIE4_5: "ernie4_5",
MODEL_ARCH.FALCON_H1: "falcon_h1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
MODEL_ARCH.FALCON_H1: "falcon_h1",
MODEL_ARCH.FALCON_H1: "falcon_h1",

Comment on lines 1960 to +1963
// TODO: There are currently no hybrid models! Once there are, this will be
// the place to identify them
switch (arch) {
case LLM_ARCH_FALCON_H1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the "TODO" comment as it is no longer relevant.

LLM_KV_SSM_HEAD_DIM,
LLM_KV_MAMBA_D_SSM,
LLM_KV_N_LAYER,
LLM_KV_FALCON_H1_MAMBA_RMS_NORM,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be simply LLM_KV_MAMBA_RMS_NORM? If there isn't anything very specific for Falcon H1, it's better to keep the names generic.

Comment on lines +126 to +128
bool ssm_rms_norm = false;
bool ssm_conv_bias = false;
bool ssm_proj_bias = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these used?

uint32_t attn_head_dim = 0;
bool mamba_rms_norm = false;
uint32_t vocab_size = 0;
uint32_t intermediate_size = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems unused?


uint32_t attn_head_dim = 0;
bool mamba_rms_norm = false;
uint32_t vocab_size = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we don't put the vocab_size as hparam. Instead, we pick it from the llama_vocab. So this is likely not needed.

uint32_t ssm_head_dim = 0;
uint32_t ssm_mamba_d_ssm = 0;

uint32_t attn_head_dim = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter should can be avoided.

See the logic here:

if (hparams.n_head() > 0) {
// gpt-neox n_rot = rotary_pct * (n_embd / n_head)
// gpt-j n_rot = rotary_dim
hparams.n_embd_head_k = hparams.n_embd / hparams.n_head();
ml.get_key(LLM_KV_ATTENTION_KEY_LENGTH, hparams.n_embd_head_k, false);
hparams.n_embd_head_v = hparams.n_embd / hparams.n_head();
ml.get_key(LLM_KV_ATTENTION_VALUE_LENGTH, hparams.n_embd_head_v, false);

And as an example how we apply it for the Gemma model which can have a custom attention head size like in your case:

llama.cpp/src/llama-model.cpp

Lines 1021 to 1025 in a3403ae

// ref: https://github.com/google/gemma_pytorch/blob/014acb7ac4563a5f77c76d7ff98f31b568c16508/gemma/config.py#L289
hparams.f_attention_scale = type == LLM_TYPE_27B
? 1.0f / std::sqrt(float(hparams.n_embd / hparams.n_head(0)))
: 1.0f / std::sqrt(float(hparams.n_embd_head_k));

@@ -115,6 +115,17 @@ struct llama_hparams {
uint32_t ssm_d_state = 0;
uint32_t ssm_dt_rank = 0;
uint32_t ssm_n_group = 0;
uint32_t ssm_head_dim = 0;
uint32_t ssm_mamba_d_ssm = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be just ssm_d_ssm?

Generally, I think the mamba prefix is not needed here. For example:

mamba_rms_norm -> ssm_rms_norm
mamba_expand -> ssm_expand

@compilade Do you agree?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I think these are not necessary, since ssm_mamba_d_ssm looks like ssm_d_inner to me, unless I'm misunderstanding.

inpSA = ggml_add(ctx0, cur, inpSA);
cb(cur, "layer_out", il);

if (il == n_layer - 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (il == n_layer - 1) {
if (il == n_layer - 1 && inp_out_ids) {

y = ggml_mul(ctx0, y, ggml_silu(ctx0, ggml_cont(ctx0, z)));

// grouped RMS norm
if (hparams.mamba_rms_norm){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (hparams.mamba_rms_norm){
if (hparams.mamba_rms_norm) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also simply use the presence of model.layers[il].ssm_norm, as I'm suggesting in #14534 (comment).

Suggested change
if (hparams.mamba_rms_norm){
if (model.layers[il].ssm_norm) {

Comment on lines -77 to +82
return (ssm_d_conv > 0 ? ssm_d_conv - 1 : 0) * (ssm_d_inner + 2*ssm_n_group*ssm_d_state);

// check if the architecture is using d_ssm
if (ssm_mamba_d_ssm > 0) {
return (ssm_d_conv > 0 ? ssm_d_conv - 1 : 0) * (ssm_mamba_d_ssm + 2*ssm_n_group*ssm_d_state);
} else {
return (ssm_d_conv > 0 ? ssm_d_conv - 1 : 0) * (ssm_d_inner + 2*ssm_n_group*ssm_d_state);
Copy link
Collaborator

@compilade compilade Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified by avoiding the duplication of the ssm_d_inner hparams into ssm_mamba_d_ssm (since both seem to be the same?)

const int64_t ssm_conv_kernel_size = hparams.ssm_d_conv; // ssm_conv_kernel_size
const int64_t ssm_n_groups = hparams.ssm_n_group; // ssm_n_groups
const int64_t ssm_state_size = hparams.ssm_d_state; // ssm_state_size
const int64_t ssm_intermediate_size = hparams.ssm_mamba_d_ssm > 0 ? hparams.ssm_mamba_d_ssm : int(hparams.mamba_expand * hidden_size); // TODO expand
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expand factor probably doesn't need to be explicitly stored or used if it's always the ratio between n_embd and ssm_d_inner.

Comment on lines +1179 to 1183

MODEL_TENSOR.SSM_NORM: (
"model.layers.{bid}.mamba.norm",
),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overrides the other entry for that tensor family. Move the tensor name to the existing entry of MODEL_TENSOR.SSM_NORM.

Comment on lines +6676 to +6679
self.gguf_writer.add_uint32("falcon_h1.attention.head_dim", self.hparams["head_dim"])
self.gguf_writer.add_uint32("falcon_h1.ssm.mamba_d_ssm", self.hparams["mamba_d_ssm"])
self.gguf_writer.add_uint32("falcon_h1.num_attention_heads", self.find_hparam(["num_attention_heads"]))
self.gguf_writer.add_uint32("falcon_h1.num_key_value_heads",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the existing add_ methods for these key names from gguf_writer.

(Aren't most of those already set above, though?)

Comment on lines +4561 to +4563
if (hparams.mamba_rms_norm == true) {
layer.ssm_norm = create_tensor(tn(LLM_TENSOR_SSM_NORM, "weight", i), {ssm_intermediate_size / ssm_n_groups, ssm_n_groups}, 0);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always true for all Falcon-H1 models?

If not, is this tensor absent when it's false? That could be used (e.g. by making the tensor optional, and only applying the norm when it's present) instead of adding a new metadata key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Falcon-H1
5 participants