You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
load_backend: loaded RPC backend from D:\AI\app\llama.cpp\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6600M (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from D:\AI\app\llama.cpp\ggml-vulkan.dll
load_backend: loaded CPU backend from D:\AI\app\llama.cpp\ggml-cpu-haswell.dll
version: 5361 (cf0a43b)
built with MSVC 19.43.34808.0 for x64
Operating systems
Windows
GGML backends
Vulkan
Hardware
Ryzen 7 5800H + RX 6600M
Models
bge-m3-FP16.gguf
Problem description & steps to reproduce
Failed to add the embedding model using the llama-b5361-bin-win-cuda12.4-x64 version on a workstation with RTX 4800. The reranking model, LLM model, and VLM model can all be added. Then, testing on my laptop with a Ryzen 7 5800H and RX 6600M using llama-b5361-bin-win-vulkan-x64, the embedding model that I had previously added in Dify cannot connect.
First Bad Commit
I upgrade every day, at least it's normal on May 5th.
Relevant log output
"11.Bge-m3":
proxy:
aliases:
- Bge-m3
# `useModelName` overrides the model name in the request# and sends a specific name to the upstream server
useModelName: "Bge-m3"
cmd: >
llama-server
--host 0.0.0.0
--port ${PORT}
--model models/gpustack/bge-m3-FP16.gguf
--ctx-size 8192
--batch-size 8192
--rope-scaling yarn
--rope-freq-scale 0.75
--embeddings
-ngl 99
[INFO] Request ::1 "GET /upstream HTTP/1.1" 200 740 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/136.0.0.0 Safari/537.36 Edg/136.0.0.0" 0s
load_backend: loaded RPC backend from D:\AI\app\llama.cpp\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6600M (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from D:\AI\app\llama.cpp\ggml-vulkan.dll
load_backend: loaded CPU backend from D:\AI\app\llama.cpp\ggml-cpu-haswell.dll
build: 5361 (cf0a43bb) with MSVC 19.43.34808.0 for x64
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8081, http threads: 15
main: loading model
srv load_model: loading model 'models/gpustack/bge-m3-FP16.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6600M) - 8176 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 389 tensors from models/gpustack/bge-m3-FP16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bert
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.size_label str = 567M
llama_model_loader: - kv 3: general.license str = mit
llama_model_loader: - kv 4: general.tags arr[str,4] = ["sentence-transformers", "feature-ex...llama_model_loader: - kv 5: bert.block_count u32 = 24llama_model_loader: - kv 6: bert.context_length u32 = 8192llama_model_loader: - kv 7: bert.embedding_length u32 = 1024llama_model_loader: - kv 8: bert.feed_forward_length u32 = 4096llama_model_loader: - kv 9: bert.attention.head_count u32 = 16llama_model_loader: - kv 10: bert.attention.layer_norm_epsilon f32 = 0.000010llama_model_loader: - kv 11: general.file_type u32 = 1llama_model_loader: - kv 12: bert.attention.causal bool = falsellama_model_loader: - kv 13: bert.pooling_type u32 = 2llama_model_loader: - kv 14: tokenizer.ggml.model str = t5llama_model_loader: - kv 15: tokenizer.ggml.pre str = defaultsrv log_server_r: request: GET /health 127.0.0.1 503[INFO] <11.Bge-m3> Health check error on http://localhost:8081/health, status code: 503llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,250002] = ["<s>", "<pad>", "</s>", "<unk>", ","...llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,250002] = [0.000000, 0.000000, 0.000000, 0.0000...llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,250002] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...llama_model_loader: - kv 19: tokenizer.ggml.add_space_prefix bool = truellama_model_loader: - kv 20: tokenizer.ggml.token_type_count u32 = 1llama_model_loader: - kv 21: tokenizer.ggml.remove_extra_whitespaces bool = truellama_model_loader: - kv 22: tokenizer.ggml.precompiled_charsmap arr[u8,237539] = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,...llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 0llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 2llama_model_loader: - kv 25: tokenizer.ggml.unknown_token_id u32 = 3llama_model_loader: - kv 26: tokenizer.ggml.seperator_token_id u32 = 2llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 1llama_model_loader: - kv 28: tokenizer.ggml.cls_token_id u32 = 0llama_model_loader: - kv 29: tokenizer.ggml.mask_token_id u32 = 250001llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = truellama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = truellama_model_loader: - kv 32: general.quantization_version u32 = 2llama_model_loader: - type f32: 244 tensorsllama_model_loader: - type f16: 145 tensorsprint_info: file format = GGUF V3 (latest)print_info: file type = F16print_info: file size = 1.07 GiB (16.25 BPW)load: model vocab missing newline token, using special_pad_id insteadload: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrectload: special tokens cache size = 4load: token to piece cache size = 2.1668 MBprint_info: arch = bertprint_info: vocab_only = 0print_info: n_ctx_train = 8192print_info: n_embd = 1024print_info: n_layer = 24print_info: n_head = 16print_info: n_head_kv = 16print_info: n_rot = 64print_info: n_swa = 0print_info: n_swa_pattern = 1print_info: n_embd_head_k = 64print_info: n_embd_head_v = 64print_info: n_gqa = 1print_info: n_embd_k_gqa = 1024print_info: n_embd_v_gqa = 1024print_info: f_norm_eps = 1.0e-05print_info: f_norm_rms_eps = 0.0e+00print_info: f_clamp_kqv = 0.0e+00print_info: f_max_alibi_bias = 0.0e+00print_info: f_logit_scale = 0.0e+00print_info: f_attn_scale = 0.0e+00print_info: n_ff = 4096print_info: n_expert = 0print_info: n_expert_used = 0print_info: causal attn = 0print_info: pooling type = 2print_info: rope type = 2print_info: rope scaling = linearprint_info: freq_base_train = 10000.0print_info: freq_scale_train = 1print_info: n_ctx_orig_yarn = 8192print_info: rope_finetuned = unknownprint_info: ssm_d_conv = 0print_info: ssm_d_inner = 0print_info: ssm_d_state = 0print_info: ssm_dt_rank = 0print_info: ssm_dt_b_c_rms = 0print_info: model type = 335Mprint_info: model params = 566.70 Mprint_info: general.name = n/aprint_info: vocab type = UGMprint_info: n_vocab = 250002print_info: n_merges = 0print_info: BOS token = 0 '<s>'print_info: EOS token = 2 '</s>'print_info: UNK token = 3 '<unk>'print_info: SEP token = 2 '</s>'print_info: PAD token = 1 '<pad>'print_info: MASK token = 250001 '[PAD250000]'print_info: LF token = 0 '<s>'print_info: EOG token = 2 '</s>'print_info: max token length = 48load_tensors: loading model tensors, this can take a while... (mmap = true)load_tensors: offloading 24 repeating layers to GPUload_tensors: offloading output layer to GPUload_tensors: offloaded 25/25 layers to GPUload_tensors: Vulkan0 model buffer size = 577.22 MiBload_tensors: CPU_Mapped model buffer size = 520.30 MiB.......................................................llama_context: constructing llama_contextllama_context: n_seq_max = 1llama_context: n_ctx = 8192llama_context: n_ctx_per_seq = 8192llama_context: n_batch = 8192llama_context: n_ubatch = 512llama_context: causal_attn = 0llama_context: flash_attn = 0llama_context: freq_base = 10000.0llama_context: freq_scale = 0.75llama_context: Vulkan_Host output buffer size = 0.00 MiBcommon_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shiftingcommon_init_from_params: setting dry_penalty_last_n to ctx_size = 8192common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)decode: cannot decode batches with this context (use llama_encode() instead)srv init: initializing slots, n_slots = 1slot init: id 0 | task -1 | new slot n_ctx_slot = 8192main: model loadedmain: chat template, chat_template: {%- for message in messages -%} {{- '<|im_start|>' + message.role + '' + message.content + '<|im_end|>' -}}{%- endfor -%}{%- if add_generation_prompt -%} {{- '<|im_start|>assistant' -}}{%- endif -%}, example_format: '<|im_start|>systemYou are a helpful assistant<|im_end|><|im_start|>userHello<|im_end|><|im_start|>assistantHi there<|im_end|><|im_start|>userHow are you?<|im_end|><|im_start|>assistant'main: server is listening on http://0.0.0.0:8081 - starting the main loopsrv update_slots: all slots are idle[INFO] <11.Bge-m3> Health check passed on http://localhost:8081/healthsrv log_server_r: request: POST /embeddings 172.31.137.1 200slot launch_slot_: id 0 | task 2 | processing taskslot update_slots: id 0 | task 2 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 26slot update_slots: id 0 | task 2 | kv cache rm [0, end)slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 26, n_tokens = 26, progress = 1.000000slot update_slots: id 0 | task 2 | prompt done, n_past = 26, n_tokens = 26slot release: id 0 | task 2 | stop processing: n_past = 26, truncated = 0srv update_slots: all slots are idlesrv log_server_r: request: POST /embeddings 172.31.137.1 200slot launch_slot_: id 0 | task 4 | processing taskslot update_slots: id 0 | task 4 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 5slot update_slots: id 0 | task 4 | kv cache rm [0, end)slot update_slots: id 0 | task 4 | prompt processing progress, n_past = 5, n_tokens = 5, progress = 1.000000slot update_slots: id 0 | task 4 | prompt done, n_past = 5, n_tokens = 5slot release: id 0 | task 4 | stop processing: n_past = 5, truncated = 0srv update_slots: all slots are idlesrv log_server_r: request: POST /embeddings 172.31.137.1 200slot launch_slot_: id 0 | task 6 | processing taskslot update_slots: id 0 | task 6 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 8slot update_slots: id 0 | task 6 | kv cache rm [0, end)slot update_slots: id 0 | task 6 | prompt processing progress, n_past = 8, n_tokens = 8, progress = 1.000000slot update_slots: id 0 | task 6 | prompt done, n_past = 8, n_tokens = 8slot release: id 0 | task 6 | stop processing: n_past = 8, truncated = 0srv update_slots: all slots are idlesrv log_server_r: request: POST /embeddings 172.31.137.1 200slot launch_slot_: id 0 | task 8 | processing taskslot update_slots: id 0 | task 8 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 7slot update_slots: id 0 | task 8 | kv cache rm [0, end)slot update_slots: id 0 | task 8 | prompt processing progress, n_past = 7, n_tokens = 7, progress = 1.000000slot update_slots: id 0 | task 8 | prompt done, n_past = 7, n_tokens = 7slot release: id 0 | task 8 | stop processing: n_past = 7, truncated = 0slot launch_slot_: id 0 | task 10 | processing taskslot update_slots: id 0 | task 10 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 9slot update_slots: id 0 | task 10 | kv cache rm [0, end)slot update_slots: id 0 | task 10 | prompt processing progress, n_past = 9, n_tokens = 9, progress = 1.000000slot update_slots: id 0 | task 10 | prompt done, n_past = 9, n_tokens = 9srv log_server_r: request: POST /embeddings 172.31.137.1 200slot release: id 0 | task 10 | stop processing: n_past = 9, truncated = 0srv update_slots: all slots are idlesrv log_server_r: request: POST /embeddings 172.31.137.1 200slot launch_slot_: id 0 | task 12 | processing taskslot update_slots: id 0 | task 12 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 9slot update_slots: id 0 | task 12 | kv cache rm [0, end)slot update_slots: id 0 | task 12 | prompt processing progress, n_past = 9, n_tokens = 9, progress = 1.000000slot update_slots: id 0 | task 12 | prompt done, n_past = 9, n_tokens = 9slot release: id 0 | task 12 | stop processing: n_past = 9, truncated = 0srv update_slots: all slots are idlesrv log_server_r: request: POST /embeddings 172.31.137.1 200slot launch_slot_: id 0 | task 14 | processing taskslot update_slots: id 0 | task 14 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 8slot update_slots: id 0 | task 14 | kv cache rm [0, end)slot update_slots: id 0 | task 14 | prompt processing progress, n_past = 8, n_tokens = 8, progress = 1.000000slot update_slots: id 0 | task 14 | prompt done, n_past = 8, n_tokens = 8slot release: id 0 | task 14 | stop processing: n_past = 8, truncated = 0slot launch_slot_: id 0 | task 16 | processing taskslot update_slots: id 0 | task 16 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 7slot update_slots: id 0 | task 16 | kv cache rm [0, end)slot update_slots: id 0 | task 16 | prompt processing progress, n_past = 7, n_tokens = 7, progress = 1.000000slot update_slots: id 0 | task 16 | prompt done, n_past = 7, n_tokens = 7srv log_server_r: request: POST /embeddings 172.31.137.1 200slot release: id 0 | task 16 | stop processing: n_past = 7, truncated = 0srv update_slots: all slots are idlesrv log_server_r: request: POST /embeddings 172.31.137.1 200slot launch_slot_: id 0 | task 18 | processing taskslot update_slots: id 0 | task 18 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 9slot update_slots: id 0 | task 18 | kv cache rm [0, end)slot update_slots: id 0 | task 18 | prompt processing progress, n_past = 9, n_tokens = 9, progress = 1.000000slot update_slots: id 0 | task 18 | prompt done, n_past = 9, n_tokens = 9slot release: id 0 | task 18 | stop processing: n_past = 9, truncated = 0srv update_slots: all slots are idlesrv log_server_r: request: POST /embeddings 172.31.137.1 200
The text was updated successfully, but these errors were encountered:
Name and Version
load_backend: loaded RPC backend from D:\AI\app\llama.cpp\ggml-rpc.dll
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6600M (AMD proprietary driver) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from D:\AI\app\llama.cpp\ggml-vulkan.dll
load_backend: loaded CPU backend from D:\AI\app\llama.cpp\ggml-cpu-haswell.dll
version: 5361 (cf0a43b)
built with MSVC 19.43.34808.0 for x64
Operating systems
Windows
GGML backends
Vulkan
Hardware
Ryzen 7 5800H + RX 6600M
Models
bge-m3-FP16.gguf
Problem description & steps to reproduce
Failed to add the embedding model using the llama-b5361-bin-win-cuda12.4-x64 version on a workstation with RTX 4800. The reranking model, LLM model, and VLM model can all be added. Then, testing on my laptop with a Ryzen 7 5800H and RX 6600M using llama-b5361-bin-win-vulkan-x64, the embedding model that I had previously added in Dify cannot connect.
First Bad Commit
I upgrade every day, at least it's normal on May 5th.
Relevant log output
The text was updated successfully, but these errors were encountered: