We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
version: 5307 (814f795) built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu
Linux
BLAS
Cobalt-100 Azure ARM
Qwen3-4B-128K-Q4* tried all
../build/bin/llama-speculative -m Qwen3-4B-128K-IQ4_XS.gguf -md Qwen3-4B-128K-Q4_0.gguf
No response
build: 5307 (814f795e) with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu llama_model_loader: loaded meta data with 44 key-value pairs and 398 tensors from Qwen3-4B-128K-IQ4_XS.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-4B-128K llama_model_loader: - kv 3: general.finetune str = 128k llama_model_loader: - kv 4: general.basename str = Qwen3-4B-128K llama_model_loader: - kv 5: general.quantized_by str = Unsloth llama_model_loader: - kv 6: general.size_label str = 4B llama_model_loader: - kv 7: general.license str = apache-2.0 llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B/... llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 10: general.base_model.count u32 = 1 llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 4B llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"] llama_model_loader: - kv 15: qwen3.block_count u32 = 36 llama_model_loader: - kv 16: qwen3.context_length u32 = 131072 llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560 llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728 llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32 llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 25: qwen3.rope.scaling.type str = yarn llama_model_loader: - kv 26: qwen3.rope.scaling.factor f32 = 4.000000 llama_model_loader: - kv 27: qwen3.rope.scaling.original_context_length u32 = 32768 llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 35: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 36: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 38: general.quantization_version u32 = 2 llama_model_loader: - kv 39: general.file_type u32 = 30 llama_model_loader: - kv 40: quantize.imatrix.file str = Qwen3-4B-128K-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 41: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-4B-128K.txt llama_model_loader: - kv 42: quantize.imatrix.entries_count i32 = 252 llama_model_loader: - kv 43: quantize.imatrix.chunks_count i32 = 10 llama_model_loader: - type f32: 145 tensors llama_model_loader: - type q5_K: 36 tensors llama_model_loader: - type q6_K: 1 tensors llama_model_loader: - type iq4_xs: 216 tensors print_info: file format = GGUF V3 (latest) print_info: file type = IQ4_XS - 4.25 bpw print_info: file size = 2.11 GiB (4.50 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 2560 print_info: n_layer = 36 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 9728 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = yarn print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 0.25 print_info: n_ctx_orig_yarn = 32768 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 4B print_info: model params = 4.02 B print_info: general.name = Qwen3-4B-128K print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: CPU_Mapped model buffer size = 2159.88 MiB ....................................................................................... llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 0.25 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.58 MiB llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1, padding = 32 llama_kv_cache_unified: CPU KV buffer size = 576.00 MiB llama_kv_cache_unified: KV self size = 576.00 MiB, K (f16): 288.00 MiB, V (f16): 288.00 MiB llama_context: CPU compute buffer size = 301.75 MiB llama_context: graph nodes = 1374 llama_context: graph splits = 578 (with bs=512), 1 (with bs=1) common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) llama_model_loader: loaded meta data with 44 key-value pairs and 398 tensors from Qwen3-4B-128K-Q4_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3-4B-128K llama_model_loader: - kv 3: general.finetune str = 128k llama_model_loader: - kv 4: general.basename str = Qwen3-4B-128K llama_model_loader: - kv 5: general.quantized_by str = Unsloth llama_model_loader: - kv 6: general.size_label str = 4B llama_model_loader: - kv 7: general.license str = apache-2.0 llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/Qwen/Qwen3-4B/... llama_model_loader: - kv 9: general.repo_url str = https://huggingface.co/unsloth llama_model_loader: - kv 10: general.base_model.count u32 = 1 llama_model_loader: - kv 11: general.base_model.0.name str = Qwen3 4B llama_model_loader: - kv 12: general.base_model.0.organization str = Qwen llama_model_loader: - kv 13: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-4B llama_model_loader: - kv 14: general.tags arr[str,2] = ["unsloth", "text-generation"] llama_model_loader: - kv 15: qwen3.block_count u32 = 36 llama_model_loader: - kv 16: qwen3.context_length u32 = 131072 llama_model_loader: - kv 17: qwen3.embedding_length u32 = 2560 llama_model_loader: - kv 18: qwen3.feed_forward_length u32 = 9728 llama_model_loader: - kv 19: qwen3.attention.head_count u32 = 32 llama_model_loader: - kv 20: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 21: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 22: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 23: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 24: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 25: qwen3.rope.scaling.type str = yarn llama_model_loader: - kv 26: qwen3.rope.scaling.factor f32 = 4.000000 llama_model_loader: - kv 27: qwen3.rope.scaling.original_context_length u32 = 32768 llama_model_loader: - kv 28: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 29: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 35: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 36: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 37: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 38: general.quantization_version u32 = 2 llama_model_loader: - kv 39: general.file_type u32 = 2 llama_model_loader: - kv 40: quantize.imatrix.file str = Qwen3-4B-128K-GGUF/imatrix_unsloth.dat llama_model_loader: - kv 41: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-4B-128K.txt llama_model_loader: - kv 42: quantize.imatrix.entries_count i32 = 252 llama_model_loader: - kv 43: quantize.imatrix.chunks_count i32 = 10 llama_model_loader: - type f32: 145 tensors llama_model_loader: - type q4_0: 248 tensors llama_model_loader: - type q4_1: 4 tensors llama_model_loader: - type q6_K: 1 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_0 print_info: file size = 2.21 GiB (4.71 BPW) load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 2560 print_info: n_layer = 36 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 9728 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = yarn print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 0.25 print_info: n_ctx_orig_yarn = 32768 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 4B print_info: model params = 4.02 B print_info: general.name = Qwen3-4B-128K print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: CPU_Mapped model buffer size = 2246.67 MiB load_tensors: CPU_KLEIDIAI model buffer size = 1895.62 MiB ........................................................................................ llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 4096 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 0.25 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CPU output buffer size = 0.58 MiB llama_kv_cache_unified: kv_size = 4096, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1, padding = 32 llama_kv_cache_unified: CPU KV buffer size = 576.00 MiB llama_kv_cache_unified: KV self size = 576.00 MiB, K (f16): 288.00 MiB, V (f16): 288.00 MiB llama_context: CPU compute buffer size = 301.75 MiB llama_context: graph nodes = 1374 llama_context: graph splits = 82 (with bs=512), 1 (with bs=1) common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) /home/alerant/llama.cpp/src/llama-batch.cpp:282: GGML_ASSERT(batch.n_tokens > 0) failed ../build/bin/llama-speculative(+0x1bfcd4)[0xc69700a3fcd4] ../build/bin/llama-speculative(+0x1bfe9c)[0xc69700a3fe9c] ../build/bin/llama-speculative(+0xf15bc)[0xc697009715bc] ../build/bin/llama-speculative(+0xfc234)[0xc6970097c234] ../build/bin/llama-speculative(+0x22976c)[0xc69700aa976c] ../build/bin/llama-speculative(+0x224f0)[0xc697008a24f0] /lib/aarch64-linux-gnu/libc.so.6(+0x284c4)[0xf5652a3d84c4] /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98)[0xf5652a3d8598] ../build/bin/llama-speculative(+0x2b3f0)[0xc697008ab3f0] Aborted (core dumped)
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Name and Version
version: 5307 (814f795)
built with cc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0 for aarch64-linux-gnu
Operating systems
Linux
GGML backends
BLAS
Hardware
Cobalt-100 Azure ARM
Models
Qwen3-4B-128K-Q4* tried all
Problem description & steps to reproduce
../build/bin/llama-speculative -m Qwen3-4B-128K-IQ4_XS.gguf -md Qwen3-4B-128K-Q4_0.gguf
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: