Skip to content

Eval bug: nomic-embed-text-v2-moe GGML_ASSERT(pc_type == ...) failed #13534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
danclaytondev opened this issue May 14, 2025 · 3 comments
Closed

Comments

@danclaytondev
Copy link

danclaytondev commented May 14, 2025

Name and Version

Tested on MacBook Pro:

$ ./llama-server --version
version: 5372 (ab3971f2)
built with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin23.6.0

and

Docker Container with an A100

$ ./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-80GB, compute capability 8.0, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
version: 5361 (cf0a43bb)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Mac, Linux

GGML backends

CUDA, Metal

Hardware

Apple M3 Pro
+
server-cuda container with an A100

Models

nomic-ai/nomic-embed-text-v2-moe-GGUF

Problem description & steps to reproduce

Running the command in nomic's model card:

llama-server -m nomic-embed-text-v2-moe.bf16.gguf --embeddings

causes the following error:

src/llama-vocab.cpp:1472: GGML_ASSERT(pc_type == GGUF_TYPE_INT8 || pc_type == GGUF_TYPE_UINT8) failed

I have found the same the GGUF for f32, and the same error running in the server-cuda container.

First Bad Commit

No response

Relevant log output

./llama-server -m ~/Downloads/nomic-embed-text-v2-moe.bf16.gguf --embeddings --port 8090
build: 5372 (ab3971f2) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin23.6.0
system info: n_threads = 5, n_threads_batch = 5, total_threads = 11

system_info: n_threads = 5 (n_threads_batch = 5) / 11 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8090, http threads: 10
main: loading model
srv    load_model: loading model '/Users/daniel.clayton/Downloads/nomic-embed-text-v2-moe.bf16.gguf'
llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 12287 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 142 tensors from /Users/daniel.clayton/Downloads/nomic-embed-text-v2-moe.bf16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = nomic-bert-moe
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = nomic-embed-text-v2-moe
llama_model_loader: - kv   3:                            general.version str              = 2048
llama_model_loader: - kv   4:                       general.organization str              = Nomic Ai
llama_model_loader: - kv   5:                           general.basename str              = nomic-xlm
llama_model_loader: - kv   6:                         general.size_label str              = 8x277M
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = Nomic Embed Text v2 Moe Unsupervised
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Nomic Ai
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/nomic-ai/nomic...
llama_model_loader: - kv  12:                               general.tags arr[str,4]       = ["sentence-transformers", "sentence-s...
llama_model_loader: - kv  13:                          general.languages arr[str,101]     = ["en", "es", "fr", "de", "it", "pt", ...
llama_model_loader: - kv  14:                 nomic-bert-moe.block_count u32              = 12
llama_model_loader: - kv  15:              nomic-bert-moe.context_length u32              = 512
llama_model_loader: - kv  16:            nomic-bert-moe.embedding_length u32              = 768
llama_model_loader: - kv  17:         nomic-bert-moe.feed_forward_length u32              = 3072
llama_model_loader: - kv  18:        nomic-bert-moe.attention.head_count u32              = 12
llama_model_loader: - kv  19: nomic-bert-moe.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  20:            nomic-bert-moe.attention.causal bool             = false
llama_model_loader: - kv  21:                nomic-bert-moe.pooling_type u32              = 1
llama_model_loader: - kv  22:              nomic-bert-moe.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  23:          nomic-bert-moe.moe_every_n_layers u32              = 2
llama_model_loader: - kv  24:                nomic-bert-moe.expert_count u32              = 8
llama_model_loader: - kv  25:           nomic-bert-moe.expert_used_count u32              = 2
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = t5
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,250048]  = ["<s>", "<pad>", "</s>", "<unk>", ","...
llama_model_loader: - kv  29:                      tokenizer.ggml.scores arr[f32,250048]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,250048]  = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  31:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  32:            tokenizer.ggml.token_type_count u32              = 1
llama_model_loader: - kv  33:    tokenizer.ggml.remove_extra_whitespaces bool             = true
llama_model_loader: - kv  34:        tokenizer.ggml.precompiled_charsmap arr[i32,237539]  = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...
llama_model_loader: - kv  35:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  36:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  37:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  38:          tokenizer.ggml.seperator_token_id u32              = 2
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  40:               tokenizer.ggml.mask_token_id u32              = 250001
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  42:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  43:               general.quantization_version u32              = 2
llama_model_loader: - kv  44:                          general.file_type u32              = 32
llama_model_loader: - type  f32:   93 tensors
llama_model_loader: - type bf16:   49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = BF16
print_info: file size   = 906.80 MiB (16.00 BPW)
/Users/daniel.clayton/code/llama.cpp/src/llama-vocab.cpp:1472: GGML_ASSERT(pc_type == GGUF_TYPE_INT8 || pc_type == GGUF_TYPE_UINT8) failed
@CISC
Copy link
Collaborator

CISC commented May 14, 2025

This is strange, tokenizer.ggml.precompiled_charsmap should be UINT8, yet it seems to be INT32 here.

The charsmap array comes from sentencepiece and should be in the form of bytes, which gets automatically stored as a UINT8 array by GGUFWriter, but not for whomever converted the linked model...

@CISC
Copy link
Collaborator

CISC commented May 14, 2025

Actually, I see the uploader was @cebtenzzre .. any clue? :)

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented May 14, 2025

I used the gguf-editor-gui contributed in #12930 by @christopherthompson81 to adjust the model name. That must have corrupted the model files. I will see if I can revert to the original GGUFs, or just reconvert them from scratch.

edit: The original, unedited model files have been restored. I would appreciate if someone could open a bug report against the gguf-editor-gui tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants