We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please note that llama-quantize.exe is failing from version b5298 (perhaps earlier) on windows 11 systems. I have also tested: b5342 , b5361, b5371
Windows
llama-quantize
./llama-quantize E:/main-du.gguf i:/llm/David_AU/testfiles/MN-Dark-Universe-MOE-4X12B-Reasoning-Q2_K.gguf Q2_K 8
ISSUE:
Example: ./llama-quantize E:/main-du.gguf i:/llm/David_AU/testfiles/MN-Dark-Universe-MOE-4X12B-Reasoning-Q2_K.gguf Q2_K 8
(used in powershell)
Generates:
main: build = 5371 (e5c834f) main: built with MSVC 19.29.30159.0 for Windows AMD64 main: quantizing 'E:/main-du.gguf' to 'i:/llm/David_AU/testfiles/MN-Dark-Universe-MOE-4X12B-Reasoning2-Q2_K.gguf' as Q2_K using 8 threads llama_model_loader: loaded meta data with 34 key-value pairs and 403 tensors from E:/main-du.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = MN Dark Universe MOE 4X12B Reasoning llama_model_loader: - kv 3: general.finetune str = Reasoning llama_model_loader: - kv 4: general.basename str = MN-Dark-Universe-MOE llama_model_loader: - kv 5: general.size_label str = 4X12B llama_model_loader: - kv 6: llama.block_count u32 = 40 llama_model_loader: - kv 7: llama.context_length u32 = 1024000 llama_model_loader: - kv 8: llama.embedding_length u32 = 5120 llama_model_loader: - kv 9: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 10: llama.attention.head_count u32 = 32 llama_model_loader: - kv 11: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 13: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 14: llama.expert_count u32 = 4 llama_model_loader: - kv 15: llama.expert_used_count u32 = 4 llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 llama_model_loader: - kv 18: general.file_type u32 = 32 llama_model_loader: - kv 19: llama.vocab_size u32 = 131072 llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 23: tokenizer.ggml.pre str = tekken llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,131072] = ["", "", "", "<|im_start|... llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,269443] = ["─á ─á", "─á t", "e r", "i n", "─á ─... llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 4 llama_model_loader: - kv 29: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - type f32: 121 tensors llama_model_loader: - type bf16: 282 tensors
-> DIES here , windows error: An unhandled win32 exception occured in [5900] llama-quantize.exe
Additional details: I am downloading the pre-made "exes" from releases. I tried Cuda 12.4, 11.7, and X64 Cpu. All same results.
Issue seems to be with loading the source files themselves. Note that versions b5215 and earlier - no issues.
Also: ./llama-quantize --help works - no issue.
b5298 (perhaps earlier)
The text was updated successfully, but these errors were encountered:
Successfully merging a pull request may close this issue.
Name and Version
Please note that llama-quantize.exe is failing from version b5298 (perhaps earlier) on windows 11 systems.
I have also tested: b5342 , b5361, b5371
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-quantize
Command line
Problem description & steps to reproduce
ISSUE:
Example:
./llama-quantize E:/main-du.gguf i:/llm/David_AU/testfiles/MN-Dark-Universe-MOE-4X12B-Reasoning-Q2_K.gguf Q2_K 8
(used in powershell)
Generates:
main: build = 5371 (e5c834f)
main: built with MSVC 19.29.30159.0 for Windows AMD64
main: quantizing 'E:/main-du.gguf' to 'i:/llm/David_AU/testfiles/MN-Dark-Universe-MOE-4X12B-Reasoning2-Q2_K.gguf' as Q2_K using 8 threads
llama_model_loader: loaded meta data with 34 key-value pairs and 403 tensors from E:/main-du.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = MN Dark Universe MOE 4X12B Reasoning
llama_model_loader: - kv 3: general.finetune str = Reasoning
llama_model_loader: - kv 4: general.basename str = MN-Dark-Universe-MOE
llama_model_loader: - kv 5: general.size_label str = 4X12B
llama_model_loader: - kv 6: llama.block_count u32 = 40
llama_model_loader: - kv 7: llama.context_length u32 = 1024000
llama_model_loader: - kv 8: llama.embedding_length u32 = 5120
llama_model_loader: - kv 9: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 10: llama.attention.head_count u32 = 32
llama_model_loader: - kv 11: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 14: llama.expert_count u32 = 4
llama_model_loader: - kv 15: llama.expert_used_count u32 = 4
llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
llama_model_loader: - kv 18: general.file_type u32 = 32
llama_model_loader: - kv 19: llama.vocab_size u32 = 131072
llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = tekken
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,131072] = ["", "
", "", "<|im_start|...llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,269443] = ["─á ─á", "─á t", "e r", "i n", "─á ─...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 4
llama_model_loader: - kv 29: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type bf16: 282 tensors
-> DIES here , windows error:
An unhandled win32 exception occured in [5900] llama-quantize.exe
Additional details:
I am downloading the pre-made "exes" from releases.
I tried Cuda 12.4, 11.7, and X64 Cpu.
All same results.
Issue seems to be with loading the source files themselves.
Note that versions b5215 and earlier - no issues.
Also:
./llama-quantize --help
works - no issue.
First Bad Commit
b5298 (perhaps earlier)
Relevant log output
The text was updated successfully, but these errors were encountered: