Skip to content

Support Seed-Coder chat template #13472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yeahdongcn
Copy link
Contributor

Make sure to read the contributing guidelines before submitting a PR

https://github.com/ByteDance-Seed/Seed-Coder

Testing Done

  • Build passed
  • llama-cli functions as expected on musa (see logs below)

Logs:

root@fc47c123a3e7:/ws# ./build/bin/llama-cli -m /models/ByteDance-Seed/Seed-Coder-8B-Reasoning/Seed-Coder-8B-Reasoning-Q4_K_M.gguf -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S80, compute capability 2.1, VMM: yes
build: 0 (unknown) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S80) - 15752 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 291 tensors from /models/ByteDance-Seed/Seed-Coder-8B-Reasoning/Seed-Coder-8B-Reasoning-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Seed Coder 8B Reasoning
llama_model_loader: - kv   3:                           general.finetune str              = Reasoning
llama_model_loader: - kv   4:                           general.basename str              = Seed-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Seed Coder 8B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = ByteDance Seed
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/ByteDance-Seed...
llama_model_loader: - kv  11:                          llama.block_count u32              = 32
llama_model_loader: - kv  12:                       llama.context_length u32              = 65536
llama_model_loader: - kv  13:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  14:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  15:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  16:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  17:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  18:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  19:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  20:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  21:                           llama.vocab_size u32              = 155136
llama_model_loader: - kv  22:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = seed-coder
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,155136]  = ["<[begin▁of▁sentence]>", "<[PAD�...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,155136]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,154737]  = ["Ġ Ġ", "Ġ t", "i n", "Ġ a", "e r...
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  30:          tokenizer.ggml.seperator_token_id u32              = 6
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.72 GiB (4.91 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 128
load: token to piece cache size = 0.9334 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 65536
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 65536
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.25 B
print_info: general.name     = Seed Coder 8B Reasoning
print_info: vocab type       = BPE
print_info: n_vocab          = 155136
print_info: n_merges         = 154737
print_info: BOS token        = 0 '<[begin▁of▁sentence]>'
print_info: EOS token        = 2 '<[end▁of▁sentence]>'
print_info: SEP token        = 6 '<[SEP▁TOKEN]>'
print_info: PAD token        = 1 '<[PAD▁TOKEN]>'
print_info: LF token         = 326 'Ċ'
print_info: EOG token        = 2 '<[end▁of▁sentence]>'
print_info: max token length = 1024
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        MUSA0 model buffer size =  4489.62 MiB
load_tensors:   CPU_Mapped model buffer size =   340.88 MiB
....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (65536) -- the full capacity of the model will not be utilized
llama_context:  MUSA_Host  output buffer size =     0.59 MiB
llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
llama_kv_cache_unified:      MUSA0 KV buffer size =   512.00 MiB
llama_kv_cache_unified: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_context:      MUSA0 compute buffer size =   311.00 MiB
llama_context:  MUSA_Host compute buffer size =    16.01 MiB
llama_context: graph nodes  = 1094
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<[begin▁of▁sentence]>user
Hello<[end▁of▁sentence]><[begin▁of▁sentence]>user
How are you?<[end▁of▁sentence]><[begin▁of▁sentence]>assistant


system_info: n_threads = 6 (n_threads_batch = 6) / 12 | MUSA : PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 252879895
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> Write python code to perform GEMM
<think>Okay, I need to write Python code that performs GEMM (General Matrix Multiply). Let's think about how to do that.

GEMM is the standard matrix multiplication where C = alpha * A * B + beta * C. But in some implementations, there might be variations. Wait, the problem statement here says to perform GEMM. So I think the input parameters would be matrices A, B, and possibly alpha and beta, and the output is the result C.

Wait, the problem statement says to write code to perform GEMM. Let me recall the exact definition. Oh right, the general form is C = alpha*op(A)*op(B) + beta*C, where op is optional transpose. But perhaps the problem is assuming that we don't transpose the matrices, and alpha and beta are given. But maybe the code should handle that.

...

@@ -183,6 +184,9 @@ llm_chat_template llm_chat_detect_template(const std::string & tmpl) {
return LLM_CHAT_TEMPLATE_BAILING;
} else if (tmpl_contains("<|header_start|>") && tmpl_contains("<|header_end|>")) {
return LLM_CHAT_TEMPLATE_LLAMA4;
} else if (tmpl_contains("raise_exception") && tmpl_contains("System role not supported") &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will clash with many other chat templates, many of them have the same messages like this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this only applies to the Reasoning model, see Instruct:
https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Instruct/blob/main/tokenizer_config.json#L1029

In fact, these models don't seem to have proper chat templates at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I actually compared the templates from llama4 and ling-lite before choosing these conditions. Do you have any recommended strings to choose? Thanks!

{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set role = message['role'] %}{{ bos_token + role + '\n' + message['content'] | trim + eos_token }}{% endfor %}{% if add_generation_prompt %}{{ bos_token + 'assistant\n'}}{% endif %}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this only applies to the Reasoning model, see Instruct: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Instruct/blob/main/tokenizer_config.json#L1029

In fact, these models don't seem to have proper chat templates at all?

Ah... I didn’t expect that.

Copy link
Collaborator

@CISC CISC May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very strange chat format, and as long as the chat templates are so sparse I don't think you can expect to detect them properly, use --jinja for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to @shenz2-2000’s reply in ByteDance-Seed/Seed-Coder#2, the difference in chat template configurations between the instruct and reasoning models is expected.

The only shared substring I could find is:
'\n' + message['content'] | trim + eos_token }}{% endfor %}{% if add_generation_prompt %}{{ bos_token + 'assistant\n'}}{% endif %}, which also appears quite generic. So currently, I’m unsure how to reliably identify the model type based on the template alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants