llama: add initial support for Falcon-H1 model family #14534

ibrahimkhadraoui · 2025-07-04T14:21:39Z

Fixes: #13681
Summary
• Add initial support for the Falcon-H1 model family.
• Implements model loading, basic inference, and tokenizer integration for Falcon-H1.
• Updates the build scripts and relevant documentation for Falcon-H1 compatibility.

Details
• Adapted model architecture and layer mapping to match Falcon-H1.
• Integrated Falcon-H1 tokenizer with automatic fallback if tokenizer files are missing.
• Added new test cases to verify Falcon-H1 model loading and inference.
• Cleaned up redundant code from previous Falcon integration attempts.

Notes
• The Falcon-H1 integration follows the same approach as other model families (see llama and Mamba support).
• This supersedes #14238 with a cleaner and more modular implementation.
• Refer to the Falcon-H1 repo for model weights and tokenizer files.

This reverts commit 243e4d1.

…-public into add-fh1-rebased

jacekpoplawski · 2025-07-04T14:29:06Z

Is the old gguf still valid or new one must be generated?

ibrahimkhadraoui · 2025-07-04T14:33:02Z

@jacekpoplawski
You need to re convert with hf_to_gguf we did some changes on the data types

…-public into add-fh1-rebased

injected mup

src/llama-model-loader.cpp

ibrahimkhadraoui · 2025-07-07T13:40:46Z

Hello @ggerganov and @compilade,

My colleague @younesbelkada and I have addressed all the comments you provided. In addition, we’ve introduced a few enhancements to further improve the implementation.

Let us know if you have any further feedback!

ggerganov · 2025-07-07T19:03:39Z

convert_hf_to_gguf.py

+        if (
+            chkhsh == "60476e1243776c4fb1b993dbd7a5f15ac22f83c80afdf425fa5ae01c8d44ef86" or
+            chkhsh == "3eda48b4c4dc7de733d1a8b3e3b4a85243dbbf704da2ee9d42c6beced8897896" or
+            chkhsh == "48f8e02c0359c0bbdd82f26909171fac1c18a457bb47573ed1fe3bbb2c1cfd4b" or
+            chkhsh == "a6b57017d60e6edb4d88ecc2845188e0eb333a70357e45dcc9b53964a73bbae6"
+        ):
+            # ref: https://huggingface.co/collections/tiiuae/falcon-h1-6819f2795bc406da60fab8df
+            res = "falcon_h1"


Why do we have multiple hashes here?

This section should be generated by the convert_hf_to_gguf_update.py script and it will be overwritten the next time we run it.

ggerganov · 2025-07-07T19:04:34Z

gguf-py/gguf/constants.py

@@ -288,6 +289,7 @@ class MODEL_ARCH(IntEnum):
    LLAMA4           = auto()
    DECI             = auto()
    FALCON           = auto()
+    FALCON_H1           = auto()


Suggested change

FALCON_H1 = auto()

FALCON_H1 = auto()

ggerganov · 2025-07-07T19:04:42Z

gguf-py/gguf/constants.py

@@ -660,6 +662,7 @@ class MODEL_TENSOR(IntEnum):
    MODEL_ARCH.DOTS1:            "dots1",
    MODEL_ARCH.ARCEE:            "arcee",
    MODEL_ARCH.ERNIE4_5:         "ernie4_5",
+    MODEL_ARCH.FALCON_H1:         "falcon_h1",


Suggested change

MODEL_ARCH.FALCON_H1: "falcon_h1",

MODEL_ARCH.FALCON_H1: "falcon_h1",

ggerganov · 2025-07-07T19:05:19Z

src/llama-arch.cpp

    // TODO: There are currently no hybrid models! Once there are, this will be
    //  the place to identify them
    switch (arch) {
+        case LLM_ARCH_FALCON_H1:


Remove the "TODO" comment as it is no longer relevant.

ggerganov · 2025-07-07T19:06:11Z

src/llama-arch.h

+    LLM_KV_SSM_HEAD_DIM,
+    LLM_KV_MAMBA_D_SSM,
+    LLM_KV_N_LAYER,
+    LLM_KV_FALCON_H1_MAMBA_RMS_NORM,


Can this be simply LLM_KV_MAMBA_RMS_NORM? If there isn't anything very specific for Falcon H1, it's better to keep the names generic.

ggerganov · 2025-07-07T19:07:00Z

src/llama-hparams.h

+    bool     ssm_rms_norm   = false;
+    bool     ssm_conv_bias  = false;
+    bool     ssm_proj_bias  = false;


Are these used?

ggerganov · 2025-07-07T19:09:29Z

src/llama-hparams.h

+    uint32_t attn_head_dim            = 0;
+    bool     mamba_rms_norm           = false;
+    uint32_t vocab_size               = 0;
+    uint32_t intermediate_size        = 0;


Seems unused?

ggerganov · 2025-07-07T19:10:46Z

src/llama-hparams.h

+
+    uint32_t attn_head_dim            = 0;
+    bool     mamba_rms_norm           = false;
+    uint32_t vocab_size               = 0;


Normally, we don't put the vocab_size as hparam. Instead, we pick it from the llama_vocab. So this is likely not needed.

ggerganov · 2025-07-07T19:13:20Z

src/llama-hparams.h

+    uint32_t ssm_head_dim   = 0;
+    uint32_t ssm_mamba_d_ssm = 0;
+
+    uint32_t attn_head_dim            = 0;


This parameter should can be avoided.

See the logic here:

llama.cpp/src/llama-model.cpp

Lines 529 to 538 in a3403ae

if (hparams.n_head() > 0) {

// gpt-neox n_rot = rotary_pct * (n_embd / n_head)

// gpt-j n_rot = rotary_dim

hparams.n_embd_head_k = hparams.n_embd / hparams.n_head();

ml.get_key(LLM_KV_ATTENTION_KEY_LENGTH, hparams.n_embd_head_k, false);

hparams.n_embd_head_v = hparams.n_embd / hparams.n_head();

ml.get_key(LLM_KV_ATTENTION_VALUE_LENGTH, hparams.n_embd_head_v, false);

And as an example how we apply it for the Gemma model which can have a custom attention head size like in your case:

llama.cpp/src/llama-model.cpp

Lines 1021 to 1025 in a3403ae

// ref: https://github.com/google/gemma_pytorch/blob/014acb7ac4563a5f77c76d7ff98f31b568c16508/gemma/config.py#L289

hparams.f_attention_scale = type == LLM_TYPE_27B

? 1.0f / std::sqrt(float(hparams.n_embd / hparams.n_head(0)))

: 1.0f / std::sqrt(float(hparams.n_embd_head_k));

ggerganov · 2025-07-07T19:17:22Z

src/llama-hparams.h

@@ -115,6 +115,17 @@ struct llama_hparams {
    uint32_t ssm_d_state = 0;
    uint32_t ssm_dt_rank = 0;
    uint32_t ssm_n_group = 0;
+    uint32_t ssm_head_dim   = 0;
+    uint32_t ssm_mamba_d_ssm = 0;


Should this be just ssm_d_ssm?

Generally, I think the mamba prefix is not needed here. For example:

mamba_rms_norm -> ssm_rms_norm mamba_expand -> ssm_expand

@compilade Do you agree?

I agree, but I think these are not necessary, since ssm_mamba_d_ssm looks like ssm_d_inner to me, unless I'm misunderstanding.

ggerganov · 2025-07-07T19:19:59Z

src/llama-model.cpp

+            inpSA = ggml_add(ctx0, cur, inpSA);
+            cb(cur, "layer_out", il);
+
+            if (il == n_layer - 1) {


Suggested change

if (il == n_layer - 1) {

if (il == n_layer - 1 && inp_out_ids) {

ggerganov · 2025-07-07T19:20:12Z

src/llama-model.cpp

+            y = ggml_mul(ctx0, y, ggml_silu(ctx0, ggml_cont(ctx0, z)));
+
+            // grouped RMS norm
+            if (hparams.mamba_rms_norm){


Suggested change

if (hparams.mamba_rms_norm){

if (hparams.mamba_rms_norm) {

This could also simply use the presence of model.layers[il].ssm_norm, as I'm suggesting in #14534 (comment).

Suggested change

if (hparams.mamba_rms_norm){

if (model.layers[il].ssm_norm) {

compilade · 2025-07-07T14:04:27Z

src/llama-hparams.cpp

-    return (ssm_d_conv > 0 ? ssm_d_conv - 1 : 0) * (ssm_d_inner + 2*ssm_n_group*ssm_d_state);
+
+    // check if the architecture is using d_ssm 
+    if (ssm_mamba_d_ssm > 0) {
+        return (ssm_d_conv > 0 ? ssm_d_conv - 1 : 0) * (ssm_mamba_d_ssm + 2*ssm_n_group*ssm_d_state);
+    } else {
+        return (ssm_d_conv > 0 ? ssm_d_conv - 1 : 0) * (ssm_d_inner + 2*ssm_n_group*ssm_d_state);


This could be simplified by avoiding the duplication of the ssm_d_inner hparams into ssm_mamba_d_ssm (since both seem to be the same?)

compilade · 2025-07-07T14:08:31Z

src/llama-model.cpp

+                    const int64_t ssm_conv_kernel_size  = hparams.ssm_d_conv; // ssm_conv_kernel_size
+                    const int64_t ssm_n_groups          = hparams.ssm_n_group; // ssm_n_groups
+                    const int64_t ssm_state_size        = hparams.ssm_d_state; // ssm_state_size
+                    const int64_t ssm_intermediate_size = hparams.ssm_mamba_d_ssm > 0 ? hparams.ssm_mamba_d_ssm : int(hparams.mamba_expand * hidden_size); // TODO expand


The expand factor probably doesn't need to be explicitly stored or used if it's always the ratio between n_embd and ssm_d_inner.

compilade · 2025-07-07T14:14:50Z

gguf-py/gguf/tensor_mapping.py

+
+        MODEL_TENSOR.SSM_NORM: (
+            "model.layers.{bid}.mamba.norm",
+        ),



This overrides the other entry for that tensor family. Move the tensor name to the existing entry of MODEL_TENSOR.SSM_NORM.

compilade · 2025-07-07T14:16:58Z

convert_hf_to_gguf.py

+        self.gguf_writer.add_uint32("falcon_h1.attention.head_dim", self.hparams["head_dim"])
+        self.gguf_writer.add_uint32("falcon_h1.ssm.mamba_d_ssm", self.hparams["mamba_d_ssm"])
+        self.gguf_writer.add_uint32("falcon_h1.num_attention_heads", self.find_hparam(["num_attention_heads"]))
+        self.gguf_writer.add_uint32("falcon_h1.num_key_value_heads", 


Use the existing add_ methods for these key names from gguf_writer.

(Aren't most of those already set above, though?)

compilade · 2025-07-07T14:32:04Z

src/llama-model.cpp

+                        if (hparams.mamba_rms_norm == true) {
+                            layer.ssm_norm = create_tensor(tn(LLM_TENSOR_SSM_NORM, "weight", i), {ssm_intermediate_size / ssm_n_groups, ssm_n_groups}, 0);
+                        }


Is this always true for all Falcon-H1 models?

If not, is this tensor absent when it's false? That could be used (e.g. by making the tensor optional, and only applying the norm when it's present) instead of adding a new metadata key.

younesbelkada and others added 26 commits July 3, 2025 14:49

v1

991de6c

push more fixes

f897efd

another fix

71a6848

fix

03568c9

more fixes

0c93ef6

minor fix

fdd5cff

more cleaning on python code

14c37ec

python fixes

8bea922

changed precision for multipliers float 32->64

071f4b7

fixes

50eadc7

merge

a39a842

another fix

1415cd8

fix

243e4d1

pre-norm -> norm

cce3549

fix

22de62c

Revert "fix"

2fe057c

This reverts commit 243e4d1.

Merge branch 'add-fh1-rebased' of https://github.com/tiiuae/llama.cpp…

d22b4ea

…-public into add-fh1-rebased

fix

6c7d9e2

small fix ffn_norm

15138df

Merge branch 'add-fh1-rebased' of https://github.com/tiiuae/llama.cpp…

a6d0067

…-public into add-fh1-rebased

try

1fd0574

mix instead of max

250b4f1

fix vocab size

3ee7983

Merge branch 'add-fh1-rebased' of https://github.com/tiiuae/llama.cpp…

2aa48dd

…-public into add-fh1-rebased

conflict solve

9760c8b

fixed multipliers

7a25441

github-actions bot added the python python script changes label Jul 4, 2025

younesbelkada mentioned this pull request Jul 4, 2025

MODEL: Falcon-H1 support #14238

Closed

younesbelkada and others added 19 commits July 7, 2025 10:56

fix conversion and d_inner

6c39e77

added some cb functions for debugging puposes

8c50893

inp_out_ids moved outside of layers loop

49d7420

mup_vec create as float64

97011d7

fix rope_theta

286e1fa

Merge branch 'add-fh1-rebased' of https://github.com/tiiuae/llama.cpp…

b3bc1fb

…-public into add-fh1-rebased

injected mup

a9f3a63

clean ups

e96cc73

Merge pull request #1 from tiiuae/injected-mup

3afb2a8

injected mup

rm extra space

0ad3502

rm unused MAMBA_CHUNK_SIZE

53446f7

rm unused key

ae937f4

add bos False

b6df0a4

changed ROPE_TYPE

935d46f

cleaning debugging stuff

624699c

cleaning debug quant

042e5ff

fix comment

f74e266

some cleanups

632861e

some cleanups

084873c

younesbelkada reviewed Jul 7, 2025

View reviewed changes

src/llama-model-loader.cpp Outdated Show resolved Hide resolved

younesbelkada and others added 3 commits July 7, 2025 17:29

Update src/llama-model-loader.cpp

fd20330

more cleanups

68cb784

moe cleanuips

d2f46f1

ibrahimkhadraoui requested review from ggerganov and compilade July 7, 2025 13:37

ibrahimkhadraoui marked this pull request as ready for review July 7, 2025 13:38

ggerganov reviewed Jul 7, 2025

View reviewed changes

compilade reviewed Jul 7, 2025

View reviewed changes

	MODEL_ARCH.FALCON_H1: "falcon_h1",
	MODEL_ARCH.FALCON_H1: "falcon_h1",

	if (hparams.n_head() > 0) {
	// gpt-neox n_rot = rotary_pct * (n_embd / n_head)
	// gpt-j n_rot = rotary_dim

	hparams.n_embd_head_k = hparams.n_embd / hparams.n_head();
	ml.get_key(LLM_KV_ATTENTION_KEY_LENGTH, hparams.n_embd_head_k, false);

	hparams.n_embd_head_v = hparams.n_embd / hparams.n_head();
	ml.get_key(LLM_KV_ATTENTION_VALUE_LENGTH, hparams.n_embd_head_v, false);


	// ref: https://github.com/google/gemma_pytorch/blob/014acb7ac4563a5f77c76d7ff98f31b568c16508/gemma/config.py#L289
	hparams.f_attention_scale = type == LLM_TYPE_27B
	? 1.0f / std::sqrt(float(hparams.n_embd / hparams.n_head(0)))
	: 1.0f / std::sqrt(float(hparams.n_embd_head_k));

	if (il == n_layer - 1) {
	if (il == n_layer - 1 && inp_out_ids) {

	if (hparams.mamba_rms_norm){
	if (model.layers[il].ssm_norm) {

llama: add initial support for Falcon-H1 model family #14534

Are you sure you want to change the base?

llama: add initial support for Falcon-H1 model family #14534

Uh oh!

Conversation

ibrahimkhadraoui commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacekpoplawski commented Jul 4, 2025

Uh oh!

ibrahimkhadraoui commented Jul 4, 2025

Uh oh!

Uh oh!

ibrahimkhadraoui commented Jul 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

compilade Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ibrahimkhadraoui commented Jul 4, 2025 •

edited

Loading

compilade Jul 7, 2025 •

edited

Loading