-
Notifications
You must be signed in to change notification settings - Fork 12.3k
Multi-node VLM inference with RPC #14386
Replies: 1 comment · 2 replies
-
What is the exact problem you are running into? |
Beta Was this translation helpful? Give feedback.
All reactions
-
TLDR;University WIFI probably blocked multi-device communication. Problem resolved when moving to hotspot. New problem, can't load larger than memory models on multiple devices even though the model shards can fit on each device (Qwen2.5-VL-72B-Instruct-UD-Q6_K_XL.gguf -> 2 shards of 25GB each, maximum RAM cutoff per 64GB device seems to be around 32-37 GB). Hey , thanks for the quick reply. I don't have the error logs at hand but: ./build/bin/llama-mtmd-cli -m $QWEN2_VL−−mmproj QWEN2_VL_mmproj -p "You are an expert maze solver. Reason how to get from the green to the red dot step by step." --image ../maze.png -ngl 37 -t 12 -fa --rpc 172.22.160.24:50052,172.22.160.133:50052 --repeat-penalty 1.0 -n 128 --tensor-split 0.5,0.5 -sm layer
build: 5656 (b7cc7745) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
llama_model_load_from_file_impl: using device RPC[172.22.160.24:50052] (RPC[172.22.160.24:50052]) - 0 MiB free Although:
For some reason the university WiFi network must have blocked the connection because I kept getting that the RPC of the remote pc had 0 MiB free. Once I connected both of the devices to my personal hotspot the issue was resolved. export QWEN72_VL_3bit=~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q3_K_S.gguf
export QWEN72_VL_3bit_mmproj=~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/mmproj-F32.gguf
export QWEN72_VL=~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-UD-Q6_K_XL.gguf
export QWEN72_VL_mmproj=~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/mmproj-F32.gguf I have managed to run inference with networks that fit on a single machine using RPC such as Qwen2.5-VL-72B_Instruct_Q3_K_S (fits on a single manchine as well)(mlx-project) kios@kiosmacmini3 llama.cpp % ./build/bin/llama-mtmd-cli -m $QWEN72_VL_3bit --mmproj $QWEN72_VL_3bit_mmproj --image ../maze.png -p "You are an expert maze solver. Reason how to get from the green to the red dot step by step." -ngl 81 -t 12 -fa
build: 5656 (b7cc7745) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) - 49151 MiB free
llama_model_loader: loaded meta data with 41 key-value pairs and 963 tensors from /Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q3_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2vl
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5-Vl-72B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5-Vl-72B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 72B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = qwen
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-V...
llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 11: general.base_model.count u32 = 1
llama_model_loader: - kv 12: general.base_model.0.name str = Qwen2.5 VL 72B Instruct
llama_model_loader: - kv 13: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 14: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-V...
llama_model_loader: - kv 15: general.tags arr[str,3] = ["multimodal", "unsloth", "image-text...
llama_model_loader: - kv 16: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 17: qwen2vl.block_count u32 = 80
llama_model_loader: - kv 18: qwen2vl.context_length u32 = 128000
llama_model_loader: - kv 19: qwen2vl.embedding_length u32 = 8192
llama_model_loader: - kv 20: qwen2vl.feed_forward_length u32 = 29568
llama_model_loader: - kv 21: qwen2vl.attention.head_count u32 = 64
llama_model_loader: - kv 22: qwen2vl.attention.head_count_kv u32 = 8
llama_model_loader: - kv 23: qwen2vl.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 24: qwen2vl.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen2vl.rope.dimension_sections arr[i32,4] = [16, 24, 24, 0]
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 34: tokenizer.chat_template str = {% set image_count = namespace(value=...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.file_type u32 = 11
llama_model_loader: - kv 37: quantize.imatrix.file str = Qwen2.5-VL-72B-Instruct-GGUF/imatrix_...
llama_model_loader: - kv 38: quantize.imatrix.dataset str = unsloth_calibration_Qwen2.5-VL-72B-In...
llama_model_loader: - kv 39: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 40: quantize.imatrix.chunks_count i32 = 691
llama_model_loader: - type f32: 401 tensors
llama_model_loader: - type q3_K: 401 tensors
llama_model_loader: - type q5_K: 80 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq4_nl: 80 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q3_K - Small
print_info: file size = 32.11 GiB (3.79 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2vl
print_info: vocab_only = 0
print_info: n_ctx_train = 128000
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 29568
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 8
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 128000
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 70B
print_info: model params = 72.71 B
print_info: general.name = Qwen2.5-Vl-72B-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: Metal_Mapped model buffer size = 32884.41 MiB
load_tensors: CPU_Mapped model buffer size = 510.47 MiB
..................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name: Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = true
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
llama_context: CPU output buffer size = 0.58 MiB
llama_kv_cache_unified: Metal KV buffer size = 1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_context: Metal compute buffer size = 313.00 MiB
llama_context: CPU compute buffer size = 24.01 MiB
llama_context: graph nodes = 2807
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
clip_model_loader: model name: Qwen2.5-Vl-72B-Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 519
clip_model_loader: n_kv: 33
clip_model_loader: has vision encoder
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: GPU name: Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = true
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
clip_ctx: CLIP using Metal backend
load_hparams: projector: qwen2.5vl_merger
load_hparams: n_embd: 1280
load_hparams: n_head: 16
load_hparams: n_ff: 3456
load_hparams: n_layer: 32
load_hparams: ffn_op: silu
load_hparams: projection_dim: 8192
--- vision hparams ---
load_hparams: image_size: 1024
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 8
load_hparams: model size: 2684.86 MiB
load_hparams: metadata size: 0.18 MiB
alloc_compute_meta: Metal compute buffer size = 2.79 MiB
alloc_compute_meta: CPU compute buffer size = 0.16 MiB
main: loading model: /Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q3_K_S.gguf
encoding image slice...
image slice encoded in 399 ms
decoding image batch 1/1, n_tokens_batch = 121
image decoded (batch 1/1) in 3891 ms
To solve this maze and get from the green dot to the red dot, follow these steps:
1. Start at the green dot in the top-left corner.
2. Move right one square.
3. Move down one square.
4. Move right one square.
5. Move down one square.
6. Move right one square.
7. Move down one square.
8. Move right one square.
9. Move down one square.
10. Move right one square.
11. Move down one square.
12. Move right one square.
13. Move down one square.
14. Move right one square.
15. Move down one square.
16. Move right one square.
17. Move down one square.
18. Move right one square.
19. Move down one square.
20. Move right one square.
21. Move down one^C
llama_perf_context_print: load time = 13417.18 ms
llama_perf_context_print: prompt eval time = 6367.43 ms / 153 tokens ( 41.62 ms per token, 24.03 tokens per second)
llama_perf_context_print: eval time = 42245.17 ms / 181 runs ( 233.40 ms per token, 4.28 tokens per second)
llama_perf_context_print: total time = 49836.64 ms / 334 tokens Qwen2.5-VL-72B_Instruct_Q6_K_M (does not fit on a single manchine)(mlx-project) kios@kiosmacmini3 llama.cpp % ./build/bin/llama-mtmd-cli -m $QWEN72_VL --mmproj $QWEN72_VL_mmproj --image ../maze.png -p "You are an expert maze solver. Reason how to get from the green to the red dot step by step." -ngl 71 -t 12 -fa --rpc 172.22.160.133:50052,172.22.160.24:50052 --repeat-penalty 1.0 -n 128 --tensor-split 0.5,0.5
build: 5656 (b7cc7745) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
llama_model_load_from_file_impl: using device RPC[172.22.160.133:50052] (RPC[172.22.160.133:50052]) - 0 MiB free
llama_model_load_from_file_impl: using device RPC[172.22.160.24:50052] (RPC[172.22.160.24:50052]) - 0 MiB free
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) - 49151 MiB free
llama_model_load: error loading model: invalid split file name: /Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-UD-Q6_K_XL.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-UD-Q6_K_XL.gguf' |
Beta Was this translation helpful? Give feedback.
All reactions
-
llama.cpp tries to load the whole model on the device, so distributed inference with RPC was only possible only after I offloaded some of the layers (7/81) to the CPU so it wouldn't crash. I ll keep on looking if I can load different shards on each Mac or if there is any tensor-prallelism implementation in a different branch. ./build/bin/llama-mtmd-cli -m ~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q5_K_M.gguf --mmproj ~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/mmproj-F32.gguf --image ../maze.png -p "You are an expert maze solver. Reason how to get from the green to the red dot step by step." -ngl 74 --rpc 192.168.100.3:50052,192.168.100.2:50052 --repeat-penalty 1.0 -n 128
build: 5656 (b7cc7745) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
llama_model_load_from_file_impl: using device RPC[192.168.100.3:50052] (RPC[192.168.100.3:50052]) - 49146 MiB free
llama_model_load_from_file_impl: using device RPC[192.168.100.2:50052] (RPC[192.168.100.2:50052]) - 49146 MiB free
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) - 49151 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 963 tensors from /Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2vl
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5-Vl-72B-Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Qwen2.5-Vl-72B-Instruct
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 72B
llama_model_loader: - kv 7: general.license str = other
llama_model_loader: - kv 8: general.license.name str = qwen
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-V...
llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 11: general.base_model.count u32 = 1
llama_model_loader: - kv 12: general.base_model.0.name str = Qwen2.5 VL 72B Instruct
llama_model_loader: - kv 13: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 14: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-V...
llama_model_loader: - kv 15: general.tags arr[str,3] = ["multimodal", "unsloth", "image-text...
llama_model_loader: - kv 16: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 17: qwen2vl.block_count u32 = 80
llama_model_loader: - kv 18: qwen2vl.context_length u32 = 128000
llama_model_loader: - kv 19: qwen2vl.embedding_length u32 = 8192
llama_model_loader: - kv 20: qwen2vl.feed_forward_length u32 = 29568
llama_model_loader: - kv 21: qwen2vl.attention.head_count u32 = 64
llama_model_loader: - kv 22: qwen2vl.attention.head_count_kv u32 = 8
llama_model_loader: - kv 23: qwen2vl.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 24: qwen2vl.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen2vl.rope.dimension_sections arr[i32,4] = [16, 24, 24, 0]
llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 27: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 31: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 34: tokenizer.chat_template str = {% set image_count = namespace(value=...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.file_type u32 = 17
llama_model_loader: - kv 37: quantize.imatrix.file str = Qwen2.5-VL-72B-Instruct-GGUF/imatrix_...
llama_model_loader: - kv 38: quantize.imatrix.dataset str = unsloth_calibration_Qwen2.5-VL-72B-In...
llama_model_loader: - kv 39: quantize.imatrix.entries_count i32 = 560
llama_model_loader: - kv 40: quantize.imatrix.chunks_count i32 = 691
llama_model_loader: - kv 41: split.no u16 = 0
llama_model_loader: - kv 42: split.tensors.count i32 = 963
llama_model_loader: - kv 43: split.count u16 = 0
llama_model_loader: - type f32: 401 tensors
llama_model_loader: - type q5_1: 40 tensors
llama_model_loader: - type q8_0: 40 tensors
llama_model_loader: - type q5_K: 441 tensors
llama_model_loader: - type q6_K: 41 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q5_K - Medium
print_info: file size = 50.70 GiB (5.99 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2vl
print_info: vocab_only = 0
print_info: n_ctx_train = 128000
print_info: n_embd = 8192
print_info: n_layer = 80
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 29568
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 8
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 128000
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 70B
print_info: model params = 72.71 B
print_info: general.name = Qwen2.5-Vl-72B-Instruct
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 74 repeating layers to GPU
load_tensors: offloaded 74/81 layers to GPU
load_tensors: Metal_Mapped model buffer size = 15258.20 MiB
load_tensors: CPU_Mapped model buffer size = 5770.67 MiB
load_tensors: RPC[192.168.100.3:50052] model buffer size = 15555.16 MiB
load_tensors: RPC[192.168.100.2:50052] model buffer size = 15335.41 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name: Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = true
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
llama_context: CPU output buffer size = 0.58 MiB
llama_kv_cache_unified: Metal KV buffer size = 384.00 MiB
llama_kv_cache_unified: CPU KV buffer size = 96.00 MiB
llama_kv_cache_unified: RPC[192.168.100.3:50052] KV buffer size = 400.00 MiB
llama_kv_cache_unified: RPC[192.168.100.2:50052] KV buffer size = 400.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB ( 4096 cells, 80 layers, 1 seqs), K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_context: RPC[192.168.100.3:50052] compute buffer size = 584.01 MiB
llama_context: RPC[192.168.100.2:50052] compute buffer size = 584.01 MiB
llama_context: Metal compute buffer size = 584.01 MiB
llama_context: CPU compute buffer size = 584.01 MiB
llama_context: graph nodes = 3126
llama_context: graph splits = 101 (with bs=512), 5 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
clip_model_loader: model name: Qwen2.5-Vl-72B-Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 519
clip_model_loader: n_kv: 33
clip_model_loader: has vision encoder
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: GPU name: Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001)
ggml_metal_init: simdgroup reduction = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets = true
ggml_metal_init: has bfloat = true
ggml_metal_init: use bfloat = false
ggml_metal_init: hasUnifiedMemory = true
ggml_metal_init: recommendedMaxWorkingSetSize = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4 (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16 (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32 (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16 (not supported)
clip_ctx: CLIP using Metal backend
load_hparams: projector: qwen2.5vl_merger
load_hparams: n_embd: 1280
load_hparams: n_head: 16
load_hparams: n_ff: 3456
load_hparams: n_layer: 32
load_hparams: ffn_op: silu
load_hparams: projection_dim: 8192
--- vision hparams ---
load_hparams: image_size: 1024
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 8
load_hparams: model size: 2684.86 MiB
load_hparams: metadata size: 0.18 MiB
alloc_compute_meta: Metal compute buffer size = 2.79 MiB
alloc_compute_meta: CPU compute buffer size = 0.16 MiB
main: loading model: /Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q5_K_M.gguf
encoding image slice...
image slice encoded in 398 ms
decoding image batch 1/1, n_tokens_batch = 121
image decoded (batch 1/1) in 5459 ms
To solve the maze and get from the green dot to the red dot, follow these steps:
1. Start at the green dot in the top-left corner.
2. Move right one square.
3. Move down one square.
4. Move right two squares.
5. Move down one square.
6. Move right one square.
7. Move down one square.
8. Move right one square.
9. Move down one square.
10. Move right one square.
11. Move down one square.
12. Move right one square.
13. Move down one square.
14. Move right one square.
llama_perf_context_print: load time = 69619.52 ms
llama_perf_context_print: prompt eval time = 13406.89 ms / 153 tokens ( 87.63 ms per token, 11.41 tokens per second)
llama_perf_context_print: eval time = 39589.15 ms / 127 runs ( 311.73 ms per token, 3.21 tokens per second)
llama_perf_context_print: total time = 54582.39 ms / 280 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey guys,
I am trying to run distributed inference of a VLM such as
Qwen2.5-VL-72B-Instruct-GGUF
on my Mac cluster at work. Currently I am experimenting with 2Mac Mini's M4 Pro 64GB
.I have read through:
I have managed to:
RPC
andMETAL
backend and launch the server/clientsQwen3-8B
,Qwen3-14B
etc.. using rpc backend and llama-cliI have been unable to run
Qwen2.5-VL-3B-Instruct
usingllama-mtmd-cli
andrpc
however.Steps to reproduce:
1. Build the same llama.cpp release with rpc and METAL backends on both Macs
brew install grpc protobuf git-lfs cmake -B build -DGGML_METAL=1 -DGGML_RPC=ON cmake --build . --config Release -j 32
2. Set manual IP addresses in the same network
3. Install the same model at the same location on both Macs
4. Start rpc servers on both machines, assign maximum memory and enable cache if you want:
./build/bin/rpc-server -H $STATIC_IP_2 -p 50052 --mem 32000 -c
./build/bin/rpc-server -H $STATIC_IP_3 -p 50052 --mem 32000 -c
5. Ping each pc from the other to ensure they can reach each other
From MacMini number 2
ping $STATIC_IP_3
From MacMini number 3
ping $STATIC_IP_2
6. Set environment variables to shorten commands:
7. Run mtmd-cli with rpc flags
Beta Was this translation helpful? Give feedback.
All reactions