-
Notifications
You must be signed in to change notification settings - Fork 12.3k
llama: add initial support for Falcon-H1 model family #14534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Is the old gguf still valid or new one must be generated? |
@jacekpoplawski |
…-public into add-fh1-rebased
injected mup
Hello @ggerganov and @compilade, My colleague @younesbelkada and I have addressed all the comments you provided. In addition, we’ve introduced a few enhancements to further improve the implementation. Let us know if you have any further feedback! |
if ( | ||
chkhsh == "60476e1243776c4fb1b993dbd7a5f15ac22f83c80afdf425fa5ae01c8d44ef86" or | ||
chkhsh == "3eda48b4c4dc7de733d1a8b3e3b4a85243dbbf704da2ee9d42c6beced8897896" or | ||
chkhsh == "48f8e02c0359c0bbdd82f26909171fac1c18a457bb47573ed1fe3bbb2c1cfd4b" or | ||
chkhsh == "a6b57017d60e6edb4d88ecc2845188e0eb333a70357e45dcc9b53964a73bbae6" | ||
): | ||
# ref: https://huggingface.co/collections/tiiuae/falcon-h1-6819f2795bc406da60fab8df | ||
res = "falcon_h1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have multiple hashes here?
This section should be generated by the convert_hf_to_gguf_update.py
script and it will be overwritten the next time we run it.
@@ -288,6 +289,7 @@ class MODEL_ARCH(IntEnum): | |||
LLAMA4 = auto() | |||
DECI = auto() | |||
FALCON = auto() | |||
FALCON_H1 = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FALCON_H1 = auto() | |
FALCON_H1 = auto() |
@@ -660,6 +662,7 @@ class MODEL_TENSOR(IntEnum): | |||
MODEL_ARCH.DOTS1: "dots1", | |||
MODEL_ARCH.ARCEE: "arcee", | |||
MODEL_ARCH.ERNIE4_5: "ernie4_5", | |||
MODEL_ARCH.FALCON_H1: "falcon_h1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MODEL_ARCH.FALCON_H1: "falcon_h1", | |
MODEL_ARCH.FALCON_H1: "falcon_h1", |
// TODO: There are currently no hybrid models! Once there are, this will be | ||
// the place to identify them | ||
switch (arch) { | ||
case LLM_ARCH_FALCON_H1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the "TODO" comment as it is no longer relevant.
LLM_KV_SSM_HEAD_DIM, | ||
LLM_KV_MAMBA_D_SSM, | ||
LLM_KV_N_LAYER, | ||
LLM_KV_FALCON_H1_MAMBA_RMS_NORM, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be simply LLM_KV_MAMBA_RMS_NORM
? If there isn't anything very specific for Falcon H1, it's better to keep the names generic.
bool ssm_rms_norm = false; | ||
bool ssm_conv_bias = false; | ||
bool ssm_proj_bias = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these used?
uint32_t attn_head_dim = 0; | ||
bool mamba_rms_norm = false; | ||
uint32_t vocab_size = 0; | ||
uint32_t intermediate_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems unused?
|
||
uint32_t attn_head_dim = 0; | ||
bool mamba_rms_norm = false; | ||
uint32_t vocab_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, we don't put the vocab_size
as hparam
. Instead, we pick it from the llama_vocab
. So this is likely not needed.
uint32_t ssm_head_dim = 0; | ||
uint32_t ssm_mamba_d_ssm = 0; | ||
|
||
uint32_t attn_head_dim = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter should can be avoided.
See the logic here:
Lines 529 to 538 in a3403ae
if (hparams.n_head() > 0) { | |
// gpt-neox n_rot = rotary_pct * (n_embd / n_head) | |
// gpt-j n_rot = rotary_dim | |
hparams.n_embd_head_k = hparams.n_embd / hparams.n_head(); | |
ml.get_key(LLM_KV_ATTENTION_KEY_LENGTH, hparams.n_embd_head_k, false); | |
hparams.n_embd_head_v = hparams.n_embd / hparams.n_head(); | |
ml.get_key(LLM_KV_ATTENTION_VALUE_LENGTH, hparams.n_embd_head_v, false); | |
And as an example how we apply it for the Gemma model which can have a custom attention head size like in your case:
Lines 1021 to 1025 in a3403ae
// ref: https://github.com/google/gemma_pytorch/blob/014acb7ac4563a5f77c76d7ff98f31b568c16508/gemma/config.py#L289 | |
hparams.f_attention_scale = type == LLM_TYPE_27B | |
? 1.0f / std::sqrt(float(hparams.n_embd / hparams.n_head(0))) | |
: 1.0f / std::sqrt(float(hparams.n_embd_head_k)); |
@@ -115,6 +115,17 @@ struct llama_hparams { | |||
uint32_t ssm_d_state = 0; | |||
uint32_t ssm_dt_rank = 0; | |||
uint32_t ssm_n_group = 0; | |||
uint32_t ssm_head_dim = 0; | |||
uint32_t ssm_mamba_d_ssm = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be just ssm_d_ssm
?
Generally, I think the mamba
prefix is not needed here. For example:
mamba_rms_norm -> ssm_rms_norm
mamba_expand -> ssm_expand
@compilade Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I think these are not necessary, since ssm_mamba_d_ssm
looks like ssm_d_inner
to me, unless I'm misunderstanding.
inpSA = ggml_add(ctx0, cur, inpSA); | ||
cb(cur, "layer_out", il); | ||
|
||
if (il == n_layer - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (il == n_layer - 1) { | |
if (il == n_layer - 1 && inp_out_ids) { |
y = ggml_mul(ctx0, y, ggml_silu(ctx0, ggml_cont(ctx0, z))); | ||
|
||
// grouped RMS norm | ||
if (hparams.mamba_rms_norm){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (hparams.mamba_rms_norm){ | |
if (hparams.mamba_rms_norm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also simply use the presence of model.layers[il].ssm_norm
, as I'm suggesting in #14534 (comment).
if (hparams.mamba_rms_norm){ | |
if (model.layers[il].ssm_norm) { |
return (ssm_d_conv > 0 ? ssm_d_conv - 1 : 0) * (ssm_d_inner + 2*ssm_n_group*ssm_d_state); | ||
|
||
// check if the architecture is using d_ssm | ||
if (ssm_mamba_d_ssm > 0) { | ||
return (ssm_d_conv > 0 ? ssm_d_conv - 1 : 0) * (ssm_mamba_d_ssm + 2*ssm_n_group*ssm_d_state); | ||
} else { | ||
return (ssm_d_conv > 0 ? ssm_d_conv - 1 : 0) * (ssm_d_inner + 2*ssm_n_group*ssm_d_state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified by avoiding the duplication of the ssm_d_inner
hparams into ssm_mamba_d_ssm
(since both seem to be the same?)
const int64_t ssm_conv_kernel_size = hparams.ssm_d_conv; // ssm_conv_kernel_size | ||
const int64_t ssm_n_groups = hparams.ssm_n_group; // ssm_n_groups | ||
const int64_t ssm_state_size = hparams.ssm_d_state; // ssm_state_size | ||
const int64_t ssm_intermediate_size = hparams.ssm_mamba_d_ssm > 0 ? hparams.ssm_mamba_d_ssm : int(hparams.mamba_expand * hidden_size); // TODO expand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expand factor probably doesn't need to be explicitly stored or used if it's always the ratio between n_embd
and ssm_d_inner
.
|
||
MODEL_TENSOR.SSM_NORM: ( | ||
"model.layers.{bid}.mamba.norm", | ||
), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overrides the other entry for that tensor family. Move the tensor name to the existing entry of MODEL_TENSOR.SSM_NORM
.
self.gguf_writer.add_uint32("falcon_h1.attention.head_dim", self.hparams["head_dim"]) | ||
self.gguf_writer.add_uint32("falcon_h1.ssm.mamba_d_ssm", self.hparams["mamba_d_ssm"]) | ||
self.gguf_writer.add_uint32("falcon_h1.num_attention_heads", self.find_hparam(["num_attention_heads"])) | ||
self.gguf_writer.add_uint32("falcon_h1.num_key_value_heads", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the existing add_
methods for these key names from gguf_writer
.
(Aren't most of those already set above, though?)
if (hparams.mamba_rms_norm == true) { | ||
layer.ssm_norm = create_tensor(tn(LLM_TENSOR_SSM_NORM, "weight", i), {ssm_intermediate_size / ssm_n_groups, ssm_n_groups}, 0); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always true for all Falcon-H1 models?
If not, is this tensor absent when it's false? That could be used (e.g. by making the tensor optional, and only applying the norm when it's present) instead of adding a new metadata key.
Fixes: #13681
Summary
• Add initial support for the Falcon-H1 model family.
• Implements model loading, basic inference, and tokenizer integration for Falcon-H1.
• Updates the build scripts and relevant documentation for Falcon-H1 compatibility.
Details
• Adapted model architecture and layer mapping to match Falcon-H1.
• Integrated Falcon-H1 tokenizer with automatic fallback if tokenizer files are missing.
• Added new test cases to verify Falcon-H1 model loading and inference.
• Cleaned up redundant code from previous Falcon integration attempts.
Notes
• The Falcon-H1 integration follows the same approach as other model families (see llama and Mamba support).
• This supersedes #14238 with a cleaner and more modular implementation.
• Refer to the Falcon-H1 repo for model weights and tokenizer files.