Multi-node VLM inference with RPC #14386

HarrySoteriou · 2025-06-26T09:17:47Z

HarrySoteriou
Jun 26, 2025

Hey guys,

I am trying to run distributed inference of a VLM such as Qwen2.5-VL-72B-Instruct-GGUF on my Mac cluster at work. Currently I am experimenting with 2 Mac Mini's M4 Pro 64GB.

I have read through:

RPC server https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc
llama-mtmd-cli https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd
Relevant issues such as server : refactor #5882 and server: Bring back multimodal support #8010

I have managed to:

Build llama.cpp with RPC and METAL backend and launch the server/clients
Run distributed LLM inference with models such as Qwen3-8B, Qwen3-14B etc.. using rpc backend and llama-cli

I have been unable to run Qwen2.5-VL-3B-Instruct using llama-mtmd-cli and rpc however.

Steps to reproduce:

1. Build the same llama.cpp release with rpc and METAL backends on both Macs

brew install grpc protobuf git-lfs
cmake -B build -DGGML_METAL=1 -DGGML_RPC=ON
cmake --build . --config Release -j 32

2. Set manual IP addresses in the same network

export STATIC_IP_2 = 192.168.100.2 # Macmini2 static IP
export STATIC_IP_3 = 192.168.100.3 # Macmini3 static IP

3. Install the same model at the same location on both Macs

./build/bin/llama-server -hf unsloth/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M

4. Start rpc servers on both machines, assign maximum memory and enable cache if you want:

./build/bin/rpc-server -H $STATIC_IP_2 -p 50052 --mem 32000 -c

./build/bin/rpc-server -H $STATIC_IP_3 -p 50052 --mem 32000 -c

5. Ping each pc from the other to ensure they can reach each other

From MacMini number 2

ping $STATIC_IP_3

From MacMini number 3

ping $STATIC_IP_2

6. Set environment variables to shorten commands:

export MODELS_ROOT=~/Library/Caches/llama.cpp
export QWEN2_VL_ROOT=Qwen/Qwen2.5-VL-3B-Instruct-GGUF
export QWEN3_DS_ROOT=unsloth/DeepSeek-R1-0528-Qwen3-8B

export QWEN2_VL=$MODELS_ROOT/$QWEN2_VL_ROOT/Qwen2.5-VL-3B-Instruct-Q4_K_M.gguf
export QWEN2_VL_mmproj=$MODELS_ROOT/$QWEN2_VL_ROOT/mmproj-F16.gguf

export QWEN3_DS=$MODELS_ROOT/$QWEN3_DS_ROOT/DeepSeek-R1-0528-Qwen3-8B-Q4_K_M.gguf

7. Run mtmd-cli with rpc flags

# Load and run a VLM:
./build/bin/llama-mtmd-cli -m $QWEN2_VL --mmproj $QWEN2_VL_mmproj \
--image "../maze.png" \
-p "You are an expert maze solver. Reason how to get from the green to the red dot step by step." \
-ngl 37 -t 12 -fa \
--rpc $STATIC_IP_3:50052,$STATIC_IP_2:50052 \
--repeat-penalty 1.0 \
-n 128

rgerganov · 2025-06-26T11:08:55Z

rgerganov
Jun 26, 2025
Collaborator

What is the exact problem you are running into?

2 replies

HarrySoteriou Jun 30, 2025
Author

TLDR;

University WIFI probably blocked multi-device communication. Problem resolved when moving to hotspot. New problem, can't load larger than memory models on multiple devices even though the model shards can fit on each device (Qwen2.5-VL-72B-Instruct-UD-Q6_K_XL.gguf -> 2 shards of 25GB each, maximum RAM cutoff per 64GB device seems to be around 32-37 GB).

Hey , thanks for the quick reply. I don't have the error logs at hand but:

./build/bin/llama-mtmd-cli -m $QWEN2_VL−−mmproj QWEN2_VL_mmproj -p "You are an expert maze solver. Reason how to get from the green to the red dot step by step." --image ../maze.png -ngl 37 -t 12 -fa --rpc 172.22.160.24:50052,172.22.160.133:50052 --repeat-penalty 1.0 -n 128 --tensor-split 0.5,0.5 -sm layer
build: 5656 (b7cc7745) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
llama_model_load_from_file_impl: using device RPC[172.22.160.24:50052] (RPC[172.22.160.24:50052]) - 0 MiB free

Although:

both pcs were on the same network with IP such as 172.22.160.24 and 172.22.160.133
I could ping each pc from the other
the firewall was disabled on both devices

For some reason the university WiFi network must have blocked the connection because I kept getting that the RPC of the remote pc had 0 MiB free. Once I connected both of the devices to my personal hotspot the issue was resolved.

export QWEN72_VL_3bit=~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q3_K_S.gguf
export QWEN72_VL_3bit_mmproj=~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/mmproj-F32.gguf


export QWEN72_VL=~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-UD-Q6_K_XL.gguf
export QWEN72_VL_mmproj=~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/mmproj-F32.gguf

I have managed to run inference with networks that fit on a single machine using RPC such as Qwen2.5-VL-72B-Instruct-Q3_K_S.gguf but not larger than single machine RAM networks such as Qwen2.5-VL-72B-Instruct-UD-Q6_K_XL.gguf

Qwen2.5-VL-72B_Instruct_Q3_K_S (fits on a single manchine as well)

(mlx-project) kios@kiosmacmini3 llama.cpp % ./build/bin/llama-mtmd-cli -m $QWEN72_VL_3bit --mmproj $QWEN72_VL_3bit_mmproj --image ../maze.png -p "You are an expert maze solver. Reason how to get from the green to the red dot step by step." -ngl 81 -t 12 -fa
build: 5656 (b7cc7745) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) - 49151 MiB free
llama_model_loader: loaded meta data with 41 key-value pairs and 963 tensors from /Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q3_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5-Vl-72B-Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Vl-72B-Instruct
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 72B
llama_model_loader: - kv   7:                            general.license str              = other
llama_model_loader: - kv   8:                       general.license.name str              = qwen
llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-V...
llama_model_loader: - kv  10:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  11:                   general.base_model.count u32              = 1
llama_model_loader: - kv  12:                  general.base_model.0.name str              = Qwen2.5 VL 72B Instruct
llama_model_loader: - kv  13:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  14:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-V...
llama_model_loader: - kv  15:                               general.tags arr[str,3]       = ["multimodal", "unsloth", "image-text...
llama_model_loader: - kv  16:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  17:                        qwen2vl.block_count u32              = 80
llama_model_loader: - kv  18:                     qwen2vl.context_length u32              = 128000
llama_model_loader: - kv  19:                   qwen2vl.embedding_length u32              = 8192
llama_model_loader: - kv  20:                qwen2vl.feed_forward_length u32              = 29568
llama_model_loader: - kv  21:               qwen2vl.attention.head_count u32              = 64
llama_model_loader: - kv  22:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  23:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  24:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  33:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - kv  36:                          general.file_type u32              = 11
llama_model_loader: - kv  37:                      quantize.imatrix.file str              = Qwen2.5-VL-72B-Instruct-GGUF/imatrix_...
llama_model_loader: - kv  38:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen2.5-VL-72B-In...
llama_model_loader: - kv  39:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  40:              quantize.imatrix.chunks_count i32              = 691
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q3_K:  401 tensors
llama_model_loader: - type q5_K:   80 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_nl:   80 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q3_K - Small
print_info: file size   = 32.11 GiB (3.79 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 128000
print_info: n_embd           = 8192
print_info: n_layer          = 80
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 29568
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 70B
print_info: model params     = 72.71 B
print_info: general.name     = Qwen2.5-Vl-72B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 80 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 81/81 layers to GPU
load_tensors: Metal_Mapped model buffer size = 32884.41 MiB
load_tensors:   CPU_Mapped model buffer size =   510.47 MiB
..................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:      Metal KV buffer size =  1280.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB (  4096 cells,  80 layers,  1 seqs), K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_context:      Metal compute buffer size =   313.00 MiB
llama_context:        CPU compute buffer size =    24.01 MiB
llama_context: graph nodes  = 2807
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

clip_model_loader: model name:   Qwen2.5-Vl-72B-Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    519
clip_model_loader: n_kv:         33

clip_model_loader: has vision encoder
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: GPU name:   Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
clip_ctx: CLIP using Metal backend
load_hparams: projector:          qwen2.5vl_merger
load_hparams: n_embd:             1280
load_hparams: n_head:             16
load_hparams: n_ff:               3456
load_hparams: n_layer:            32
load_hparams: ffn_op:             silu
load_hparams: projection_dim:     8192

--- vision hparams ---
load_hparams: image_size:         1024
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       8

load_hparams: model size:         2684.86 MiB
load_hparams: metadata size:      0.18 MiB
alloc_compute_meta:      Metal compute buffer size =     2.79 MiB
alloc_compute_meta:        CPU compute buffer size =     0.16 MiB
main: loading model: /Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q3_K_S.gguf
encoding image slice...
image slice encoded in 399 ms
decoding image batch 1/1, n_tokens_batch = 121
image decoded (batch 1/1) in 3891 ms

To solve this maze and get from the green dot to the red dot, follow these steps:

1. Start at the green dot in the top-left corner.
2. Move right one square.
3. Move down one square.
4. Move right one square.
5. Move down one square.
6. Move right one square.
7. Move down one square.
8. Move right one square.
9. Move down one square.
10. Move right one square.
11. Move down one square.
12. Move right one square.
13. Move down one square.
14. Move right one square.
15. Move down one square.
16. Move right one square.
17. Move down one square.
18. Move right one square.
19. Move down one square.
20. Move right one square.
21. Move down one^C


llama_perf_context_print:        load time =   13417.18 ms
llama_perf_context_print: prompt eval time =    6367.43 ms /   153 tokens (   41.62 ms per token,    24.03 tokens per second)
llama_perf_context_print:        eval time =   42245.17 ms /   181 runs   (  233.40 ms per token,     4.28 tokens per second)
llama_perf_context_print:       total time =   49836.64 ms /   334 tokens

Qwen2.5-VL-72B_Instruct_Q6_K_M (does not fit on a single manchine)

(mlx-project) kios@kiosmacmini3 llama.cpp % ./build/bin/llama-mtmd-cli -m $QWEN72_VL --mmproj $QWEN72_VL_mmproj --image ../maze.png -p "You are an expert maze solver. Reason how to get from the green to the red dot step by step." -ngl 71 -t 12 -fa --rpc 172.22.160.133:50052,172.22.160.24:50052 --repeat-penalty 1.0 -n 128 --tensor-split 0.5,0.5
build: 5656 (b7cc7745) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
llama_model_load_from_file_impl: using device RPC[172.22.160.133:50052] (RPC[172.22.160.133:50052]) - 0 MiB free
llama_model_load_from_file_impl: using device RPC[172.22.160.24:50052] (RPC[172.22.160.24:50052]) - 0 MiB free
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) - 49151 MiB free
llama_model_load: error loading model: invalid split file name: /Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-UD-Q6_K_XL.gguf
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-UD-Q6_K_XL.gguf'

HarrySoteriou Jul 1, 2025
Author

llama.cpp tries to load the whole model on the device, so distributed inference with RPC was only possible only after I offloaded some of the layers (7/81) to the CPU so it wouldn't crash. I ll keep on looking if I can load different shards on each Mac or if there is any tensor-prallelism implementation in a different branch.

./build/bin/llama-mtmd-cli -m  ~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q5_K_M.gguf --mmproj  ~/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/mmproj-F32.gguf --image ../maze.png -p "You are an expert maze solver. Reason how to get from the green to the red dot step by step." -ngl 74 --rpc 192.168.100.3:50052,192.168.100.2:50052 --repeat-penalty 1.0 -n 128
build: 5656 (b7cc7745) with Apple clang version 17.0.0 (clang-1700.0.13.5) for arm64-apple-darwin24.5.0
llama_model_load_from_file_impl: using device RPC[192.168.100.3:50052] (RPC[192.168.100.3:50052]) - 49146 MiB free
llama_model_load_from_file_impl: using device RPC[192.168.100.2:50052] (RPC[192.168.100.2:50052]) - 49146 MiB free
llama_model_load_from_file_impl: using device Metal (Apple M4 Pro) - 49151 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 963 tensors from /Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5-Vl-72B-Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Vl-72B-Instruct
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 72B
llama_model_loader: - kv   7:                            general.license str              = other
llama_model_loader: - kv   8:                       general.license.name str              = qwen
llama_model_loader: - kv   9:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-V...
llama_model_loader: - kv  10:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  11:                   general.base_model.count u32              = 1
llama_model_loader: - kv  12:                  general.base_model.0.name str              = Qwen2.5 VL 72B Instruct
llama_model_loader: - kv  13:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  14:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-V...
llama_model_loader: - kv  15:                               general.tags arr[str,3]       = ["multimodal", "unsloth", "image-text...
llama_model_loader: - kv  16:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  17:                        qwen2vl.block_count u32              = 80
llama_model_loader: - kv  18:                     qwen2vl.context_length u32              = 128000
llama_model_loader: - kv  19:                   qwen2vl.embedding_length u32              = 8192
llama_model_loader: - kv  20:                qwen2vl.feed_forward_length u32              = 29568
llama_model_loader: - kv  21:               qwen2vl.attention.head_count u32              = 64
llama_model_loader: - kv  22:            qwen2vl.attention.head_count_kv u32              = 8
llama_model_loader: - kv  23:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  24:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  26:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  27:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  28:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  30:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  32:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  33:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {% set image_count = namespace(value=...
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - kv  36:                          general.file_type u32              = 17
llama_model_loader: - kv  37:                      quantize.imatrix.file str              = Qwen2.5-VL-72B-Instruct-GGUF/imatrix_...
llama_model_loader: - kv  38:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen2.5-VL-72B-In...
llama_model_loader: - kv  39:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  40:              quantize.imatrix.chunks_count i32              = 691
llama_model_loader: - kv  41:                                   split.no u16              = 0
llama_model_loader: - kv  42:                        split.tensors.count i32              = 963
llama_model_loader: - kv  43:                                split.count u16              = 0
llama_model_loader: - type  f32:  401 tensors
llama_model_loader: - type q5_1:   40 tensors
llama_model_loader: - type q8_0:   40 tensors
llama_model_loader: - type q5_K:  441 tensors
llama_model_loader: - type q6_K:   41 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q5_K - Medium
print_info: file size   = 50.70 GiB (5.99 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 128000
print_info: n_embd           = 8192
print_info: n_layer          = 80
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 29568
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 70B
print_info: model params     = 72.71 B
print_info: general.name     = Qwen2.5-Vl-72B-Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 74 repeating layers to GPU
load_tensors: offloaded 74/81 layers to GPU
load_tensors: Metal_Mapped model buffer size = 15258.20 MiB
load_tensors:   CPU_Mapped model buffer size =  5770.67 MiB
load_tensors: RPC[192.168.100.3:50052] model buffer size = 15555.16 MiB
load_tensors: RPC[192.168.100.2:50052] model buffer size = 15335.41 MiB
...................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_load_library: using embedded metal library
ggml_metal_init: GPU name:   Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_context:        CPU  output buffer size =     0.58 MiB
llama_kv_cache_unified:      Metal KV buffer size =   384.00 MiB
llama_kv_cache_unified:        CPU KV buffer size =    96.00 MiB
llama_kv_cache_unified: RPC[192.168.100.3:50052] KV buffer size =   400.00 MiB
llama_kv_cache_unified: RPC[192.168.100.2:50052] KV buffer size =   400.00 MiB
llama_kv_cache_unified: size = 1280.00 MiB (  4096 cells,  80 layers,  1 seqs), K (f16):  640.00 MiB, V (f16):  640.00 MiB
llama_context: RPC[192.168.100.3:50052] compute buffer size =   584.01 MiB
llama_context: RPC[192.168.100.2:50052] compute buffer size =   584.01 MiB
llama_context:      Metal compute buffer size =   584.01 MiB
llama_context:        CPU compute buffer size =   584.01 MiB
llama_context: graph nodes  = 3126
llama_context: graph splits = 101 (with bs=512), 5 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

clip_model_loader: model name:   Qwen2.5-Vl-72B-Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    519
clip_model_loader: n_kv:         33

clip_model_loader: has vision encoder
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M4 Pro
ggml_metal_init: picking default device: Apple M4 Pro
ggml_metal_init: GPU name:   Apple M4 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f16                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h192          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk192_hv128   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_hk576_hv512   (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h64       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h96       (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h192      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk192_hv128 (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_hk576_hv512 (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
clip_ctx: CLIP using Metal backend
load_hparams: projector:          qwen2.5vl_merger
load_hparams: n_embd:             1280
load_hparams: n_head:             16
load_hparams: n_ff:               3456
load_hparams: n_layer:            32
load_hparams: ffn_op:             silu
load_hparams: projection_dim:     8192

--- vision hparams ---
load_hparams: image_size:         1024
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       8

load_hparams: model size:         2684.86 MiB
load_hparams: metadata size:      0.18 MiB
alloc_compute_meta:      Metal compute buffer size =     2.79 MiB
alloc_compute_meta:        CPU compute buffer size =     0.16 MiB
main: loading model: /Users/kios/.lmstudio/models/unsloth/Qwen2.5-VL-72B-Instruct-GGUF/Qwen2.5-VL-72B-Instruct-Q5_K_M.gguf
encoding image slice...
image slice encoded in 398 ms
decoding image batch 1/1, n_tokens_batch = 121
image decoded (batch 1/1) in 5459 ms

To solve the maze and get from the green dot to the red dot, follow these steps:

1. Start at the green dot in the top-left corner.
2. Move right one square.
3. Move down one square.
4. Move right two squares.
5. Move down one square.
6. Move right one square.
7. Move down one square.
8. Move right one square.
9. Move down one square.
10. Move right one square.
11. Move down one square.
12. Move right one square.
13. Move down one square.
14. Move right one square.


llama_perf_context_print:        load time =   69619.52 ms
llama_perf_context_print: prompt eval time =   13406.89 ms /   153 tokens (   87.63 ms per token,    11.41 tokens per second)
llama_perf_context_print:        eval time =   39589.15 ms /   127 runs   (  311.73 ms per token,     3.21 tokens per second)
llama_perf_context_print:       total time =   54582.39 ms /   280 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-node VLM inference with RPC #14386

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multi-node VLM inference with RPC #14386

Uh oh!

HarrySoteriou Jun 26, 2025

1. Build the same llama.cpp release with rpc and METAL backends on both Macs

2. Set manual IP addresses in the same network

3. Install the same model at the same location on both Macs

4. Start rpc servers on both machines, assign maximum memory and enable cache if you want:

5. Ping each pc from the other to ensure they can reach each other

6. Set environment variables to shorten commands:

7. Run mtmd-cli with rpc flags

Replies: 1 comment · 2 replies

Uh oh!

rgerganov Jun 26, 2025 Collaborator

Uh oh!

HarrySoteriou Jun 30, 2025 Author

TLDR;

Qwen2.5-VL-72B_Instruct_Q3_K_S (fits on a single manchine as well)

Qwen2.5-VL-72B_Instruct_Q6_K_M (does not fit on a single manchine)

Uh oh!

HarrySoteriou Jul 1, 2025 Author

HarrySoteriou
Jun 26, 2025

Replies: 1 comment 2 replies

rgerganov
Jun 26, 2025
Collaborator

HarrySoteriou Jun 30, 2025
Author

HarrySoteriou Jul 1, 2025
Author