Skip to content

Add PLaMo-2 model #14560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 55 commits into
base: master
Choose a base branch
from
Open

Add PLaMo-2 model #14560

wants to merge 55 commits into from

Conversation

mitmul
Copy link

@mitmul mitmul commented Jul 7, 2025

Make sure to read the contributing guidelines before submitting a PR

Based on #7531

How to check if the plamo-2-translate works with this PR. First, retrieve the model itself by:

git clone https://huggingface.co/pfnet/plamo-2-translate

Then, I needed to modify the tokenizer.jsonl to pad some meaningless vocabs to align the vocabulary size to what is specified in config.json, namely it should be 100032 by using this script:

#!/usr/bin/env python3
"""Fix PLaMo-2 tokenizer by adding missing padding tokens."""

import json
import shutil

def fix_tokenizer():
    # Backup original file
    shutil.copy("plamo-2-translate/tokenizer.jsonl", "plamo-2-translate/tokenizer.jsonl.backup")
    
    # Read existing tokens
    with open("plamo-2-translate/tokenizer.jsonl", "r", encoding="utf-8") as f:
        lines = f.readlines()
    
    print(f"Current number of tokens: {len(lines)}")
    
    # Add 32 padding tokens
    # Use the same format as other special tokens in the file
    for i in range(32):
        token_id = 100000 + i
        # Create padding token with same format as other special tokens
        padding_token = [f"<pad_{i}>", 0.0, "CONTROL", "basic", 8, None, [[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]]
        lines.append(json.dumps(padding_token, ensure_ascii=False) + "\n")
    
    # Write back
    with open("plamo-2-translate/tokenizer.jsonl", "w", encoding="utf-8") as f:
        f.writelines(lines)
    
    print(f"New number of tokens: {len(lines)}")
    print("Tokenizer fixed!")

if __name__ == "__main__":
    fix_tokenizer()

Next, convert the model into gguf by the following command:

python convert_hf_to_gguf.py plamo-2-translate --outfile plamo-2-translate.gguf --outtype f32

Then build binaries as follows:

cmake -B release
cmake --build release --config Release

and finally, I successfully run the plamo-2-translate model as follows:

./release/bin/llama-cli -m plamo-2-translate.gguf -p "<|plamo:op|>dataset\ntranslation\n<|plamo:op|>input lang=English\nHello, how are you?\n<|plamo:op|>output\n" -no-cnv --verbose-prompt --no-warmup -sp
intermediate outputs
build: 5876 (272ffdb6) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M1 Max) - 64424 MiB free
llama_model_loader: loaded meta data with 37 key-value pairs and 467 tensors from plamo-2-translate.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = plamo2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Plamo 2 Translate
llama_model_loader: - kv   3:                         general.size_label str              = 10B
llama_model_loader: - kv   4:                            general.license str              = other
llama_model_loader: - kv   5:                       general.license.name str              = plamo-community-license
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/pfnet/plamo-2-...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Plamo 2 8b
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Pfnet
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/pfnet/plamo-2-8b
llama_model_loader: - kv  11:                               general.tags arr[str,3]       = ["plamo", "translation", "translation"]
llama_model_loader: - kv  12:                          general.languages arr[str,2]       = ["en", "ja"]
llama_model_loader: - kv  13:             plamo2.attention.head_count_kv arr[i32,32]      = [0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, ...
llama_model_loader: - kv  14:                      plamo2.context_length u32              = 10485760
llama_model_loader: - kv  15:                    plamo2.embedding_length u32              = 4096
llama_model_loader: - kv  16:                         plamo2.block_count u32              = 32
llama_model_loader: - kv  17:                plamo2.attention.head_count u32              = 32
llama_model_loader: - kv  18:    plamo2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  19:        plamo2.attention.group_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  20:        plamo2.attention.layer_norm_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                      plamo2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:                      plamo2.ssm.state_size u32              = 64
llama_model_loader: - kv  23:                     plamo2.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  24:                  plamo2.ssm.time_step_rank u32              = 64
llama_model_loader: - kv  25:                      plamo2.ssm.inner_size u32              = 8192
llama_model_loader: - kv  26:                     plamo2.ssm.group_count u32              = 0
llama_model_loader: - kv  27:                 plamo2.feed_forward_length u32              = 16384
llama_model_loader: - kv  28:                          general.file_type u32              = 0
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - kv  30:                       tokenizer.ggml.model str              = plamo2
llama_model_loader: - kv  31:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  32:                      tokenizer.ggml.tokens arr[str,100032]  = ["<|plamo:unk|>", "<|plamo:bos|>", "<...
llama_model_loader: - kv  33:                      tokenizer.ggml.scores arr[f32,100032]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,100032]  = [2, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  35:                tokenizer.ggml.eos_token_id u32              = 4
llama_model_loader: - kv  36:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - type  f32:  467 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = all F32
print_info: file size   = 35.50 GiB (32.00 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 61
load: token to piece cache size = 0.7989 MB
print_info: arch             = plamo2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 10485760
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = [0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4, 0, 4]
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = [0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8, 0, 8]
print_info: n_embd_k_gqa     = [0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512]
print_info: n_embd_v_gqa     = [0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512, 0, 512]
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 16384
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 10485760
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 4
print_info: ssm_d_inner      = 8192
print_info: ssm_d_state      = 64
print_info: ssm_dt_rank      = 64
print_info: ssm_n_group      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 9.53 B
print_info: general.name     = Plamo 2 Translate
print_info: vocab type       = PLaMo2
print_info: n_vocab          = 100032
print_info: n_merges         = 0
print_info: BOS token        = 1 '<|plamo:bos|>'
print_info: EOS token        = 4 '<|plamo:op|>'
print_info: UNK token        = 0 '<|plamo:unk|>'
print_info: PAD token        = 3 '<|plamo:pad|>'
print_info: LF token         = 10 '
'
print_info: EOG token        = 4 '<|plamo:op|>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: Metal_Mapped model buffer size = 34784.34 MiB
load_tensors:   CPU_Mapped model buffer size =  1563.00 MiB
.............................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (10485760) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 67554.51 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_set_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_c4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.38 MiB
llama_kv_cache_unified:      Metal KV buffer size =   128.00 MiB
llama_kv_cache_unified: size =  128.00 MiB (  4096 cells,  16 layers,  1 seqs), K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_memory_recurrent: mem_size = 1, n_seq_max = 1, type_r = 'f32', type_s = 'f32', n_layer = 32
llama_memory_recurrent:      Metal KV buffer size =    33.50 MiB
llama_memory_recurrent: KV self size  =   33.50 MiB, R (f32):    1.50 MiB, S (f32):   32.00 MiB
llama_context:      Metal compute buffer size =   306.10 MiB
llama_context:        CPU compute buffer size =    16.01 MiB
llama_context: graph nodes  = 2038
llama_context: graph splits = 9
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 10 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | REPACK = 1 | 

main: prompt: '<|plamo:op|>dataset
translation
<|plamo:op|>input lang=English
Hello, how are you?
<|plamo:op|>output
'
main: number of tokens in prompt = 20
     4 -> '<|plamo:op|>'
 45474 -> 'dataset'
    10 -> '
'
 18053 -> 'translation'
    10 -> '
'
     4 -> '<|plamo:op|>'
  1760 -> 'input'
 98700 -> ' lang'
    61 -> '='
 14134 -> 'English'
    10 -> '
'
  6721 -> 'Hello'
    44 -> ','
  1205 -> ' how'
  1089 -> ' are'
  1099 -> ' you'
  1076 -> '?
'
     4 -> '<|plamo:op|>'
  3045 -> 'output'
    10 -> '
'

sampler seed: 64554044
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

Output:

<|plamo:op|>dataset
translation
<|plamo:op|>input lang=English
Hello, how are you?
<|plamo:op|>output
こんにちは、ご機嫌いかがですか?
<|plamo:op|> [end of text]


llama_perf_sampler_print:    sampling time =       0.29 ms /    26 runs   (    0.01 ms per token, 89347.08 tokens per second)
llama_perf_context_print:        load time =    6939.57 ms
llama_perf_context_print: prompt eval time =     378.42 ms /    20 tokens (   18.92 ms per token,    52.85 tokens per second)
llama_perf_context_print:        eval time =     625.90 ms /     5 runs   (  125.18 ms per token,     7.99 tokens per second)
llama_perf_context_print:       total time =    7566.19 ms /    25 tokens
ggml_metal_free: deallocating

Seems correctly working!

compilade added 30 commits April 3, 2024 20:47
This will be necessary to support Jamba
(and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
* llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.

* llama : gracefully fail when not finding hybrid slot
* ggml : simplify SSM-related operators

* llama : make recurrent state slot allocation contiguous

* llama : adapt internal uses of batches to llama_ubatch
This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.
This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.
This removes the need for ggml_ssm_conv!!!
But performance seems slighly worse on my system,
especially for prompt processing.
Maybe ggml_mul_mat isn't optimized for small row sizes?
More performance testing is necessary until GGML_OP_SSM_CONV is removed.

* ggml : make ggml_ssm_scan not modify its source tensors

* llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.
* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it,
and this makes Mamba's conv step slightly faster.
This can be changed back later if the name change is wrong.
I was renaming the functions anyway to generalize kv-cache-related
functions to hybrid and recurrent model architectures.
I think llama_past is a better name than llama_cache for a combined
kv cache and recurrent state cache, because the states it contains
pretty much always come before the newly-added ones for any particular
sequence. Also 'llama_past_clear' sounds more obvious in what it does
than 'llama_kv_cache_clear'. The future is what the models generate.
(For embeddings, the kv cache isn't really used anyway)

Still, I'm open to better suggestions.
compilade and others added 23 commits June 11, 2024 23:27
This also slightly reduces the diff from the master branch
Also begin reverting some implicit state rollback code.
But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).
@mitmul mitmul mentioned this pull request Jul 7, 2025
@github-actions github-actions bot added examples python python script changes labels Jul 7, 2025
@mitmul mitmul changed the title Mitmul/add plamo2 Add PLaMo-2 model Jul 7, 2025
@mitmul mitmul marked this pull request as ready for review July 7, 2025 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants